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Abstract 


Estimating the motion state of objects is a central component of most visual 
tracking pipelines. Therefore, object observations provided by an appearance 
model, representing the object in image space, serve as input for the actual 
filtering and the prediction into future frames. Under real-life conditions, the 
dynamics of tracked objects are subject to change over time. Especially in 
such maneuver scenarios, current methods struggle to deal with the model 
mismatch due to varying system characteristics. 


This thesis addresses the problem of how to capture the dynamics of maneu- 
vering objects in an efficient and reactive way. Towards this end, the per- 
spective of recursive Bayesian filters and the perspective of deep learning ap- 
proaches on state estimation are considered and their functional viewpoints 
are brought together. 


The starting point of this thesis is the interacting multiple-model (IMM) 
filter, as the most common representative Bayesian formulation for dealing 
with model mismatches or rather maneuvering objects. For a model mismatch 
scenario, in which tracking is done directly in image space, a state de-coupling 
and a re-coupling scheme are introduced as modifications for an improved 
design compared to the standard IMM filter. 


In order to deal with two maneuver types, switching noise levels and switch- 
ing dynamics, recurrent neural network (RNN)-based approaches are pro- 
posed as alternatives to IMM filtering. The approaches maintain the func- 
tionality of an IMM filter while reducing the amount of required filter tuning. 
With a focus on applications in the surveillance and intelligent vehicle do- 
mains, the effectiveness of RNN-based solutions is demonstrated for the ex- 
emplary tasks of path prediction and intention prediction, reflecting the most 
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common prototypical maneuver types. The presented RNN-based network 
yields performance comparable to other existing relevant methods on a pub- 
lic benchmark. The suggested modifications help to achieve a robust predic- 
tion performance with regard to switching noise levels. For sudden motion 
changes, a proposed RNN-based IMM surrogate can capture the change in 
the dynamical behavior mare reliably than the Bayesian filter counterparts. 
The abilities of the RNN-IMM are evaluated in extensive experiments on real- 
world and synthetic datasets, reflecting prototypical maneuver situations of 
pedestrians in the application domain of intelligent vehicles. 
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Kurzfassung 


Die Schätzung des Bewegungszustands von Objekten ist eine zentrale 
Komponente für die video-basierte Objektverfolgung. Dabei werden Objekt- 
beobachtungen, die von einem Erscheinungsmodell geliefert werden und 
das Objekt im Bildraum reprásentieren, als Eingabe für die Filterung und 
die Vorhersage in zukünftige Frames verwendet. Unter realen Bedingungen 
variiert die Dynamik des verfolgten Objektes über die Zeit. Besonders in 
solchen Manóversituationen haben aktuelle Methoden wegen Modellfehlan- 
passungen aufgrund der variierenden Systemeigenschaften Schwierigkeiten 
den Bewegungszustand des Objektes zu schátzen. 


Diese Arbeit befasst sich mit dem Problem der effizienten und reaktiven Er- 
fassung der Dynamik von manóvrierenden Objekten. Zu diesem Zweck wer- 
den die Perspektive rekursiver Bayes'scher Filter und die Perspektive tiefer 
lernender Ansätze zur Zustandsschätzung betrachtet und ihre funktionalen 
Sichtweisen zusammengeführt. 


Ausgangspunkt dieser Arbeit ist das interacting multiple-model (IMM)- 
Filter, als einer der am häufigsten verwendete Ansätze basierend auf einer 
Bayes’sche Formulierung zum Umgang mit Modellfehlanpassungen bzw. ma- 
növrierenden Objekten. Für ein Modellfehlanpassungsszenario, bei dem die 
Objektverfolgung direkt im Bildraum erfolgt, werden eine Zustandsentkopp- 
lung und ein Rückkopplungsschema als Modifikationen für ein verbesser- 
tes Design im Vergleich zum Standard-IMM-Filter eingeführt. Zum besseren 
Umgang mit den zwei Manövertypen von variierenden Rauschpegeln und 
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Kurzfassung 


variierenden Objektdynamiken werden recurrent neural network (RNN)- 
basierte Ansätze als Alternative zum IMM-Filter vorgestellt. Die Ansätze bil- 
den die Funktionalität eines IMM-Filters ab und reduzieren gleichzeitig den 
Umfang der erforderlichen Filterabstimmung. 


Mit dem Schwerpunkt auf Anwendungen in den Bereichen Videoüberwa- 
chung und intelligente Fahrzeuge wird die Wirksamkeit der vorgestellten 
RNN-basierten Ansátze exemplarisch für Aufgabenstellungen der Pfad- 
vorhersage und der Intentionsvorhersage demonstriert. Die ausgewählten 
Anwendungen spiegeln prototypische Manóversituationen wieder. Ein vor- 
gestelltes RNN-basiertes Netzwerk erzielt eine Leistung vergleichbar mit 
relevanten Methoden auf dem aktuellen Stand der Technik auf einem óf- 
fentlichen Benchmark. Die vorgeschlagenen Modifikationen tragen dazu 
bei eine robuste Vorhersageleistung in Bezug auf die Rauschpegel zu er- 
reichen. Bei plótzlichen Bewegungsánderungen kann ein vorgeschlagenes 
RNN-basiertes IMM-Surrogat die Änderung im dynamischen Verhalten zu- 
verlässiger erfassen als die Bayes’sche Filter Pendants. Die Fähigkeiten des 
RNN-IMM werden in umfangreichen Experimenten auf realen und syntheti- 
schen Datensätzen, die prototypische Manóversituationen von Fußgängern 
im Anwendungsbereich intelligenter Fahrzeuge widerspiegeln, evaluiert. 
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Notation 


This chapter introduces the notation and symbols which are used in this thesis. 


General notation 


Scalars italic Roman and Greek lowercase letters x,a 
Sets calligraphic Roman uppercase letters D 
Vectors bold Roman lowercase letters t 
Matrices bold Roman uppercase letters R 
State spaces bold calligraphic Roman uppercase letters X 


In multidimensional sets of elements related to time series, the first super- 
script index denotes time. 


Distributions 
N Gaussian distribution 
Bin Binomial distribution 
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Notation 


Numbers, indexing and conventions 


N natural numbers 

R real numbers 

k,t discrete points in time 

i, j,€,q indexing for objects, observations and points 

B ceil operator, the least integer greater than or equal to the 


value. 


State modeling and probabilities 


X (dynamical) state-space 

H (recurrent) state-space 

EA observation space 

Y target space 

fo dynamical model 

h(-)ops observation model 

F system matrix of the Kalman Filter 

G noise gain matrix of the Kalman Filter 
H observation matrix of the Kalman Filter 
K Kalman gain 

E[-] expectation value 

x* (dynamical) state vector at time k 

h* (recurrent) state vector at time k 

zk observation vector at time k 

yk target vector at time k 

m* dynamical mode at time k 

vk process noise at time k 

wk observation noise at time k 

Q* process noise covariance matrix at time k 


Notation 


RK observation noise covariance matrix at time k 
P covariance matrix 
pk. (dynamical) state covariance matrix 
pk observation covariance matrix 
k,— 5 "Te 
x prior probability 
Et posterior probability 
px) probability density function (pdf) 
P(m*) probability mass function (pmf) 


p(x**|x*,..) transition density 
P(m**! m*,...) transition probability 
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1 Introduction 


One fundamental ability essential for intelligent autonomous systems to see, 
understand, and react to the environment is to track objects of interest in im- 
age sequences. Its applications cover a broad range from intelligent vehicles 
to robot navigation and smart video surveillance. For example, the ability to 
anticipate the actions of pedestrians in a scene and to predict their future po- 
sitions is a safety issue for autonomous vehicles and other vision-based active 
safety systems. 


Figure 1.1: Scenes captured from an approaching vehicle, the most important question being 
whether the pedestrian is going to cross the street. Traditionally, such questions are 
tackled with adaptive recursive Bayesian filters [Sch13]. 


Despite enormous advances in extracting observations of objects from im- 
ages due to deep learning, the actual filtering and the prediction into future 
frames are mainly restricted to the application of recursive Bayesian filters. 
The problem-specific choice of their design parameters, such as connecting 
the object motion uncertainty and its predictability to physical system pa- 
rameters or the observation uncertainty, requires not only well-suited phys- 
ical models, but also a large amount of engineering. Especially in situations 
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where the tracked objects perform a maneuver, this has proved a challenging 
task. A maneuver is any motion characteristic that an object is performing 
other than the dynamical model used by the filter. An illustrative example for 
a variation in the dynamics, which poses significant challenges for the filters 
to adapt, is to determine if a pedestrian is going to cross the street (see fig- 
ure 1.1). Such situations additionally require the choice of various adequate 
dynamical models, including associated transition modeling. 


Towards this end, the overarching research question of this thesis is how to 
capture the dynamics of maneuvering objects in image sequences effectively. 
Maneuvering objects can be defined by either being subject to random per- 
turbations i.e., different noise levels or subject to sudden motion changes. 


The dynamical model acts as one component of a larger vision system whose 
tasks mainly consist of providing additional information for further process- 
ing steps, supporting appearance models by bridging detection failures, and 
forecasting the behavior by predicting future states. 


1.1 Problem Statement 


Given a short sequence of observations Z generated by an appearance model 
of a visual tracker, we are interested in estimating the state of a maneuver- 
ing object. In the following, systems where the discrete-time version of the 
motion or dynamical model can be formalized as follows are considered: 


Y = fo (ZUK, 9*9) +e, (1.1) 


The aim is to estimate the expected conditioned states of the object 


Ey re], (1.2) 


Here, Y describes the states or state distributions of a tracked object, C 
describes additional contextual cues extracted from the observed image 
sequences and e describes an additional error term. In Bayesian filtering, 
models of this type are called state-space models or dynamical systems, 


1.2 Contributions 


whereas in deep learning, they are referred to as recurrent neural networks. 
This thesis includes discussions on both formulations and their connection 
by maximum likelihood inference. In order to effectively capture the dy- 
namics of maneuvering objects and to reduce the amount of engineering, a 
comparable deep learning solution to adaptive recursive Bayesian filtering 
is introduced. The research questions in this context are answered along 
with the prototypical types of maneuvers, abrupt change of motions and 
random perturbations. The main application areas throughout this thesis are 
intelligent vehicles and automated surveillance systems. 


12 Contributions 


The starting point of this thesis are adaptive filters and their most common 
representative, the interacting multiple-model (IMM) filter [Blo88]. Based 
on a Bayesian formulation, the IMM filter is designed for capturing motion 
uncertainties and modeling complex object dynamics in situations where the 
object undergoes sudden changes. The IMM filter can be used to combine sev- 
eral dynamical models and offers a good compromise between performance 
and complexity. This thesis contributes to an improved design of a basic IMM 
filter as a module in a visual tracking pipeline by introducing both a state 
de-coupling and a re-coupling scheme as modifications. 


De-coupling: Firstly, when relying solely on visual cues, the benefit of a 
suggested de-coupling of the state estimate of an IMM filter is demonstrated 
[Bec16]. 


Re-coupling: Secondly, a state re-coupling scheme is introduced which helps 
to better deal with the corresponding observation uncertainties of such a 
tracking pipeline [Bec18a]. 


Although the IMM filter has some drawbacks, it is still a core element for many 
state-of-the-art applications. In order to reduce the amount of required engi- 
neering and to learn an improved dynamical model structure, a contribution 
of this thesis is the transfer of the IMM functionality into a comparable deep 
learning architecture. Since adaptive filters and in particular the IMM filter 
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are designed to deal with maneuvering objects, in the following two major 
maneuver types are considered separately. 


Switching noise levels: Switching noise levels: The effectiveness of deep 
neural networks for predicting future pedestrian states is evaluated. A pro- 
posed network achieves state-of-the-art performance on publicly available 
datasets [Bec18c]. The results can be accessed on the TrajNet website (http: 
//trajnet.stanford.edu/, last accessed 19.12.2019). The ranking of different pre- 
dictors combines the final displacement error and the average displacement 
error for predicting the next 12 states of pedestrian trajectories pooled over 
the datasets. The proposed network is an RNN-encoder with a dense layer 
on top for projecting into the observation space. Although being simple at 
core, the network can achieve a performance comparable to more elaborated 
models in terms of considering more cues than solely position information. 


Switching behavior: The connections to the IMM filter are explored and 
an IMM filter surrogate is presented (RNN-IMM). Similar to an IMM filter 
solution, the presented RNN-IMM assigns a probability value to different dy- 
namical modes and, based on them, generates a multi-modal distribution over 
future object states as output[Bec19b, Bec19a]. The switching behavior is 
thoroughly analyzed for prototypical, critical maneuver situations, such as 
a bending in maneuver of pedestrians. The presented RNN-IMM solution re- 
duces not only the amount of explicit modeling offilter parameters but enables 
an improved maneuver onset and maneuver termination behavior. 


In order to provide a learned reference trajectory for pooled object trajectory 
data, this thesis contributes by introducing an alignment network. 


Alignment network: The application of hard-coded normalization strate- 
gies on pooled trajectories shifts the variation along the trajectory. Hence, 
the arbitrarily chosen references hinder applying clustering approaches. The 
proposed network learns a freely adjustable prototype as a reference trajec- 
tory. Firstly, the resulting prototype reflects the minimum variance of the 
input trajectories, which allows deducing the dominating dynamical behav- 
ior. Secondly, with a fixed reference, the conditions for clustering approaches 
and out-of-distribution detections are improved. 


1.3 Outline 


Overall, this thesis is motivated by uniting the interconnected Bayesian and 
deep learning perspectives on maneuver prediction. In response to the re- 
search question on how to effectively capture changing object dynamics in 
image sequences, a transfer to an IMM filter comparable neural networks is 
introduced. 


13 Outline 


The thesis is structured as follows: Chapter 2 introduces the theoretical back- 
ground in order to unite the functional views of deep learning and Bayesian 
filtering on object tracking. Furthermore, current state-of-the-art is surveyed 
for selected exemplary applications. In chapter 3, the problem of maneuvering 
object tracking is considered from the Bayesian filtering perspective, result- 
ing in improved IMM filter designs. Chapter 4 presents an alternative deep 
learning-based solution in order to reduce the amount of hand-tuning of the 
filters and to provide an effective solution for the switching state problem. 
Conclusions are drawn in chapter 5. 


2 Perspectives on State Estimation 
from Visual Observations 


In this chapter, the perspectives of deep learning and recursive Bayesian fil- 
tering on (visual) object tracking are united. Based on the united functional 
view, the contributions of this thesis are positioned with respect to existing 
literature and to specific applications. 


2.1 What is Visual Tracking? 


Vision-based or visual tracking is defined as the process of using image obser- 
vations and a predictive dynamical model to consistently estimate the state(s) 
of one or more object(s) over the discrete-time steps corresponding to video 
frames [Mag11]. Thereby recursive Bayesian filtering acts mostly as a top- 
down process for state estimation, which involves incorporating prior infor- 
mation about the scene or object to connect the object dynamics to physical 
systems [Bla03]. This tracking pipeline with top-down filtering is often re- 
ferred to as detection-by-tracking [And08] and without top-down filtering as 
tracking-by-detection. A block diagram for a single object visual tracking 
pipeline is visualized in figure 2.1. 


Vision-based tracking of a single object is formulated as the estimation of a 
time series Z — {zk ı ke N} over a set of discrete-time instances k, based on 
the information J = {rk : k € N] from the set of images. The vector-valued 
time series Z is considered as the states of the object and is mainly referred 
to as the trajectory of the object. However, for recursive Bayesian filters or 
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dynamical systems * the term state X = {xk : k € N} refers to a collection of 
variables such as position, velocity, orientation, which are indirectly observed 
through noisy observations. 


Object Visual Statistical 
initialization representation learning 
dM a 


Figure 2.1: Block diagram of a visual tracking pipeline which shows the main components of a 
tracking cycle. 


State 


n. rm Localization 
estimation 


Since the focus of this thesis is neither on building an appearance model for 
object detection nor the necessary feature extraction, but on the top-down 
state estimation, it is crucial to distinguish between the term states clearly. 
The term observation z* will be used to describe the object representation 
(e.g., bounding box, centroid, blobs) in the image generated by an appearance 
model of a detector or a visual tracker. Thus, appearance modeling basically 
boils down to representing object pixel intensities. The resulting associated 
image region is the observation serving as input for the dynamical state esti- 
mation. Pedestrian detections in the form of enclosing bounding-boxes is an 
illustrative example of an observation space Æ. Observation, detection, and 
object state interchangeably refer to the shape approximation in the image in 
contrast to the dynamical state x*, which fully describes a dynamical system. 


In figure 2.2, commonly used object representations for describing the location 
andan approximation of the object shape are depicted. For the goal of tracking 
an object in the 2D image space, the minimal form of z* is the center position 
of the object in I*. 


* The terms state-space models and dynamical systems are used interchangeably in this thesis. 
Whereas the term state-space models originates from probabilistic modeling, the term dynam- 
ical systems originates from signal processing. Bayesian filtering refers to the Bayesian way of 
formulating optimal filtering for dynamical systems. 


2.1 What is Visual Tracking? 


Figure 2.2: Examples of object states for different visual tracking tasks. 


Deep-tracking based approaches are mostly tailored to image processing tasks 
such as classification and detection, thus detection-by-tracking with a Bayes- 
ian filter is still very common [Kre17]. The tasks of the Bayesian filter within 
the overall pipeline are: 


* The support of the appearance model to bridge detection failures or 
occlusion situations. 


* Provide additional information for subsequent processing stages. 
* Enhance the detection robustness. 
* Estimate indirectly observables. 


* Forecast the behavior of the object. 


Within the pipeline, the tasks ofthe Bayesian filter can be explicitly associated 
with different types of inference problems: prediction, filtering, and smooth- 
ing. Because inference is a very general problem for machine learning models, 
the consideration of the filter functionality as an inference problem helps to 
unite the functional viewpoints. Furthermore, the types of required compu- 
tations are neatly separated in order to reason from sequential data correctly 
[Moh15]. 
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2.2 One Problem - Two Functional Views 


Recursive Bayesian filtering refers to the Bayesian way of formulating the es- 
timation of the hidden (dynamical) states using probability theory. Hence, the 
hidden dynamical states and the observations are assumed to be random vari- 
ables. The dynamical state itself is represented by means of a probability 
density function (pdf)? p(x*) at time step k. It is assumed that the tran- 


k+1)xk) is the same for all time instances and behaves 


sition density? p(x 
according to a known system transition function. This function is referred 
to as the dynamical model (see equation 1.1). Other commonly used terms 
include, among others, motion model, process model, and plant model. For 


recursive Bayesian filtering, the dynamical model can be written as 
xt cp eL (2.1) 


Here, f k(.) is a non-linear function and v* the process noise. In the remainder 
of this thesis, only discrete-time models are considered because the observa- 
tions are a set of discrete-time instants. The time steps are related through 
tk+1 = tk + AT , where AT is the sampling time. The dynamical state x is 
assumed to be an unobserved Markov process, and z are the observations of 
a hidden Markov model (HMM). The observation or measurement model 
maps the hidden dynamical state into the observation space and is given by 


zh (31, w“), (2.2) 


k 
obs 


A graphical model, which expresses the conditional dependence structure of 
such an HMM, is depicted in figure 2.3 (see for example [Kol09]). The structure 


k 


with the non-linear observation function h*, (-) and observation noise w*. 


of the model complies with a directed acyclic graph representing the factor- 
ization of the joint probability. 


? In case of a discrete state-space probability mass function (pmf) P (xk). 
? [n case of a discrete state-space transition probability P(x% jx), 
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2.2 One Problem - Two Functional Views 


As aforementioned, the conditional density p(x *! |x*), which depends on the 


dynamical model, is assumed to be stationary. This is equivalent to assuming 
that the parameters of the transition function are shared across time steps 


[Moh15]. 


k+1 


Figure 2.3: A graphical model specifying the conditional relations for a dynamical system. 


Thus, it is possible to directly connect to recurrent neural networks (RNNs) 
[Goo16, Rum88] and the loss function of RNNs using maximum likelihood 
estimation. Under the Markov assumption, the probability of an observed 


sequence Z according to a dynamical system (DS) as depicted in figure 2.3 


can be calculated by marginalizing over xk, 


p(2,...,.z*)= Il f p(z*,x*) dx*, with (2.3) 
k 
p(z* x^) = pak |x") pæ" |x). 
Using the negative log-likelihood, the following loss function can be obtained: 
£(O)ps = — Dy log f p(x* |x*- p(z* x") dx*. (2.4) 
k 


For deterministic transition dynamics, 


Po(&*|x*=1) = (xk — fa(x* 1,271), (2.5) 
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the loss function can be reformulated to 


L(®)ps = — » log p(z*| fo (x 1,271) (2.6) 
k 


=), log p(z“ | fo(x*-1)). 
k 


Next, the loss function is recovered from the perspective of RNNs. According 
to equation 2.3, the goal is to capture the probability of the observed sequence 
2. RNNs are extensions of multi-layer feed-forward networks, where hidden 
units J( = {hk : k € N] are used to encode an internal hidden state space 
[Gooi6]. In extension to multi-layer networks, the parameters are shared 
across different parts of a model. Here, the parameter of the transition func- 
tion are shared across time steps, resulting in a neural network where the 
activation of the hidden layers are fed back into the network along with the 
input. Figure 2.4 depicts the unfolded computational graph of an RNN, where 
the hidden state sequence is used to compute the output vector sequence 


O = {ok : k EN}. 


| | | 


k-1 k ze 


Figure 2.4: Recurrent neural network seen as an unfolded computational graph. Each node is 
associated with a particular time instance. 
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2.2 One Problem - Two Functional Views 


The unfolded model structure corresponds, similarly to recursive Bayesian 
filtering, to a directed acyclic computational graph. Thus, the recurrent net- 
Work processes information by incorporating it into the state h that is passed 
forward through time by the transition function. 


The hidden state for one time step is given by 

hk+! = fo ih. get. (2.7) 
For a basic RNN [Elm90] the transition function is given by 

h**! = $(W,,h* + Wizz**! + bp), (2.8) 


where W,,, and W,, represents the weights, b; the biases of a recurrent layer, 
and $(.) an activation function. Based on ideas of graph unrolling and para- 
meter sharing, a wide variety of recurrent neural networks can be designed 
[Goo16]. For the moment, we stick with an RNN as depicted in figure 2.4 that 
generates an output at each time step and uses a hidden-to-hidden recurrent 
connection as described above. The depicted RNN does not specify what form 
the output and loss function take. Thus, the output o* can be used to param- 


k+1jgK) over possible next observations 


k*llo 


eterize a predictive distribution p(z 
z**1. In order to match z, the form of p(z K) must be chosen carefully. 
The problem of finding a good predictive distribution can be very challenging 
and is usually referred to as probability density modeling [Gra13a]. Given a 


hidden state, the output is computed as follows 
k _ k 
o* = 0(W,,h* + bo), (2.9) 


where o (-) is the output layer function, Wj, denotes a weight matrix and b, 
denotes a bias vector. The complete network defines a function, parameter- 
ized by the weight matrices, from observations z°'* to the output vector of. 
Equation 2.7 can be considered as the RNN equivalent ofthe dynamical model, 
and equation 2.9 can be considered as the RNN equivalent of the observation 


model. 
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The probability of an observed sequence Z as estimated by an RNN is given 
by 


p(2,...,z*) = II». (2.10) 
k 


The corresponding loss function of an RNNs using maximum likelihood esti- 
mation can be defined as: 


L(O)RNN 


— > log p(z**! |o*), or rather (2.11) 
k 


— Y; log p(z**1|fo (h*-1, zk)). (2.12) 
k 


Due to the deterministic nature of RNNs, the computation of the predictive 
distributions is realized by the feed-forward operations in the unfolded net- 
work. By applying backpropagation through time (BPTT) [Wil95] to the 
computational graph, the partial derivatives of the loss with respect to the 
network weights can efficiently be calculated, and the network can be trained 
using stochastic gradient descent. 


When comparing the loss functions of equation 2.12 and 2.6, it becomes evi- 
dent that the RNN loss corresponds to maximum likelihood estimation with 
deterministic dynamics. According to Bayesian filtering, the result for the 
associated inference problems are given in form of a conditional probability 
density that represents the dynamical state estimate [Hub15]. The estima- 
tion tasks depend on the relation between the time steps k and K. If k « K, 
the estimation problem is referred to as prediction (inferring the future), for 
k — K the estimation is referred to as filtering, update, or correction respec- 
tively (inferring the present), and if k > K, itis referred to as smoothing (infer- 
ring the past). Prediction and filtering are typically performed on-line, while 
smoothing is an off-line estimation task, as it improves past state estimates 
given additional information. For Bayesian filters, the conditional densities 
are calculated recursively under the assumption that the dynamical state is 
a Markov process. For the tracking pipeline described in section 2.1 and for 
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many other practical applications, prediction and filtering are performed al- 
ternatingly, which is commonly referred to as prediction-update cycle. Based 
on the above drawn connections between the Bayesian perspective and the 
RNN perspective, for both on-line estimation tasks of recursive Bayesian fil- 
ters, there exists an RNN counterpart, where prediction and update are realized 
by feed-forward operations in the unfolded network. For applying Bayesian 
filtering, strict assumptions such as Gaussian transitions are required to solve 
the inference problem, but those assumptions are commonly violated in a real 
environment. Before the seminal Kalman filter [Kal60] is introduced as a basic 
dynamical state estimator, this thesis is positioned with respect to the existing 
literature for two selected tasks, where the role of a top-down state estimator 
as part of a vision-based tracking is a crucial component. While extracting 
the observation from images is specific to computer vision, inference is very 
general, and the field of machine learning and pattern recognition is entered. 
In order to narrow down the large number of existing approaches originating 
from different communities, these approaches are categorized with respect to 
the applied motion model, the level of contextual information used, and the 
time horizon under consideration. The following discussion uses mainly the 
Bayesian perspective for positioning the contributions compared to related 
work. 


2.3 Related Work 


The two selected tasks, in which higher-level processing strongly relies on the 
state estimator, are path prediction and intention prediction. Whereby path pre- 
diction is mainly tackled as a pure prediction problem for intention prediction, 
both prediction and filtering is mostly done jointly (see for example [Gav99]). 


2.3.1 Path Prediction 


In tasks such as path prediction, the term agent often denotes the dynamic 
objects of interest such as robots, pedestrians, cyclists, cars, or other human- 
driven vehicles. The target agent is the dynamic object for which the motion 
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prediction is done and corresponds to our tracked object. The term path is 
here restrictively used for a sequence of positions, and the term trajectory can 
include additional information for describing the movement of the object. In 
our case, the term trajectory corresponds to a sequence of object states (see 
section 2.1) * However, the focus here is on path prediction, but the predic- 
tion of video frames, actions, articulated motion, or human activities often 
rely on the same motion prediction methods. As explained, there is a cross- 
disciplinary interest and a fast-growing body of work for motion prediction. 
In order to categorize the different prediction methods, we built on the tax- 
onomy introduced by Rudenko et al. [Rud20]. In accordance with this taxon- 
omy, motion prediction is categorized with respect to the modeling approach 
and the type of contextual cues. In figure 2.5, the categories of the taxonomy 
introduced by Rudenko et al. are visualized. 


Categorization from other related surveys may differ slightly, but are similar 
at their core. In order to name a few, there are surveys from application do- 
mains such as service robots [Kru13, Las17], intelligent vehicles [Ras19, Rid18, 
Bro16, Lef14], and computer vision [Hir18, Mur17, Mor08]. 


Most relevant for this thesis is the survey of Hirakawa et al. [Hir18]. They 
survey path prediction methods for vision-based systems, where all the consi- 
dered methods are realized on top of computer vision tasks, such as pedestrian 
detection. This corresponds precisely to our distinction between appearance 
modeling to generate observations as input data and the top-down state esti- 
mator. Hirakawa et al. categorize motion modeling approaches mainly into 
Bayesian models, energy minimization methods, deep learning methods, and 
inverse reinforcement learning methods. In addition, the approaches are cate- 
gorized depending on whether they explicitly use object features or environ- 
mental features extracted from a video. Rasouli and Kotsos [Ras19] survey 
pedestrian behavior in the application domain of intelligent vehicles and use 
the terms pedestrian factors and environmental factors to distinguish with 
respect to the awareness of a specific factor. 


* In robotics, the term path is used for describing a space curve without a notion of time and the 
term trajectory is used for a path with a notion of time. 
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Modeling 


Motion 
prediction 


Contextual 


Figure 2.5: Overview of the taxonomy of categories according to Rudenko et al. [Rud20]. The 
categorization of this thesis within the taxonomy is highlighted in red. 
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With the taxonomy of Rudenko et al., this distinction is addressed by contex- 
tual cues. The categories utilized by Kruse et al. [Kru13], Lasota et al. [Las17] 
and Lefévre et al. [Lef14] are also based on motion modeling and correspond- 
ingly included in the taxonomy. 


The taxonomy depicted in figure 2.5 enables to distinguish along the different 
modeling approaches and along an increasing level of contextual awareness. 
Using the motion modeling approach as a classification criterion, prediction 
approaches are divided in physics-based methods, pattern-based methods and 
planning-based methods. The second criterion asks what contextual cues are 
exploited, leading to a classification between target agent cues, dynamic envi- 
ronment cues, and static environment cues. 


In addition to this taxonomy, we distinguish if some contextual cues are used 
in an additional processing step in order to associate the object's dynamical 
state with the physical world. From the perspective of a Bayesian filter with a 
dynamical model, it is important if a reasonable observation model can be ap- 
plied. In particular, path prediction is mostly done on ground level, which 
implicitly requires additional assumptions about the environment or addi- 
tional sensors (LIDAR, stereo camera system) or approaches like structure- 
from-motion (SfM) [Sze10] to reconstruct a 3D scene. For example, an intelli- 
gent vehicle is accompanied by many additional sensors, which allow the ac- 
tual prediction of dynamic objects being done in an ego-motion compensated 
vehicle centered coordination system. Thus, even when the environmental 
cues are not used as input for the motion prediction itself, the overall vision- 
based system is aware of its environment. Thus, the system implicitly relies 
on more contextual cues. However, there exist several scenarios where this 
mapping is unknown, includes substantially higher expense, or is an overall 
unsolved problem. An example is general object tracking. In such a case, 
the object is directly tracked in image space on randomly selected videos. In 
several domains, this implicit knowledge of the environment is presumed but 
not always given. An example from the domain of visual surveillance is video 
recordings without extrinsic or intrinsic calibration of the cameras. The con- 
dition of being able to rely on mapping to the physical world or not is referred 
to as explicit or implicit contextual cues in the remainder of this thesis. 
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In addition to the categories proposed by Rudenko et al., the relevant time 
horizon helps to further differentiate between prediction methods. Motion 
prediction methods can be roughly categorized into short-term prediction 
with relevant time horizons of 0.5 — 2 seconds, and into long-term predic- 
tion with relevant time horizons of 5 — 20 seconds. In figure 2.6, a mapping 
between preferred motion modeling approaches and the prediction time hori- 
zon is visualized. 


short-term prediction long-term prediction 
0.5 — 2 seconds ahead 5 — 20 seconds ahead 
= 


Planning- 
based 
modeling 


Physics-based Pattern-based 


modeling modeling 


Figure 2.6: Categorization of the relevant time horizon for different motion prediction ap- 
proaches. 


Depending on an increasing time horizon, a shift in the preferred motion mo- 
deling category is visible. Due to the context of maneuvering objects, it is 
clear that a quick reaction to a change in motion is required, and only short- 
term prediction is considered. Nevertheless, the category of planning-based 
methods is kept for a better overall view on motion prediction. 


Physics-based methods: Physics-based methods define an explicit transi- 
tion function, the dynamical model, which is based on Newton’s law of mo- 
tion as part of a recursive Bayesian filter. Individual dynamical models differ 
according to the type of motion they describe. Different motion types include 
maneuvering or non-maneuvering motions, the complexity of object dynam- 
ics, and the noise model. 
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As already described, prediction is done by inferring from observed cues. In 
the prediction taxonomy, these models are subdivided into single-model ap- 
proaches and multiple-model approaches that involve several modes of dy- 
namics. In situations where the object's behavior changes abruptly, multiple- 
model approaches are utilized. In order to model the motion of maneuvering 
objects, a fusion of different prototypical motion models is done. A more de- 
tailed description of different fusion strategies and the technical background 
of multiple-model approaches is given in chapter 3. In short, multiple-model 
methods include an adaptive set of dynamical models and a fusion strategy 
to select individual models [Poo17, Koo16, Koo19, Sch13, Agal2]. Examples 
of single-model methods include the approaches of Yamaguchi et al. [Yam11], 
Pelligrini et al. [Pel09], Zernetsch et al. [Zer16], and Elganar et al. [Eln01]. 


Physics-based methods are commonly considered for short-term predictions. 
In contrast to pattern-based methods, they can readily be applied to unknown 
environments without the need for training data. They provide fast and effi- 
cient inference including explicit handling of prediction uncertainty. Draw- 
backs are the limited expressive power and the large amount of engineer- 
ing required to design a filter [Bar02]. Physics-based approaches are due to 
their generalization ability and their fast inference still the most popular ap- 
proaches for applications with a short prediction time horizon, such as colli- 
sion avoidance [Rud20]. 


Pattern-based methods: Instead of using an explicit motion model, pattern- 
based methods learn generalized transitions and trajectories from training 
data. This is done by using different function approximators such as HMM 
or neural networks. Depending on the type of function approximator, two 
main categories are distinguished. Sequential methods typically learn condi- 
tional models under the assumption that the dynamical state is conditionally 
dependent on the history of past states. As shown in section 2.2, this func- 
tion approximator can be realized with HMMs and RNNs. In most cases, the 
function approximator is realized as a regression problem. For neural net- 
works, the corresponding loss function is that of a feed-forward network, 
with an appropriate distance function for the path being predicted, such as 
the squared loss. Under certain assumptions, such as discrete or Gaussian 
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transitions with random dynamical states, HMM and Kalman filters, respec- 
tively, allow learning a function approximator. More recent approaches use 
variational inference or particle Markov-chain-Monte-Carlo (MCMC) for 
large-scale dynamical systems [Bar12]. In order to predict a sequence of state 
transitions, consecutive one-step predictions are made to concatenation into 
paths of arbitrary length. 


Examples of sequential pattern-based methods are the approaches of Vem- 
ula et al. [Vem18], Keller et al. [Kel14], Goldhammer et al. [Gol14], Alahi et al. 
[Ale17, Ala16], Kucner et al. [Kuc17], Zhang et al. [Zha19], and Xue et al. 
[Xue19]. A more elaborate description of the technical background for se- 
quential pattern-based methods will be given in chapter 4. 


Non-sequential methods aim to learn a set of motion patterns or directly 
model the distribution over full trajectories without temporal factorization 
of the dynamics. Commonly, non-sequential approaches are based on clus- 
tering in order to identify sets of long-term motion patters in the observed 
trajectories. Clustering is an unsupervised machine learning technique 
for identifying structure in unlabeled data [Bis06]. For generating useful 
clusters, the clustering approaches address issues such as the definition of a 
distance or similarity measure, update methodology, and cluster validation 
[Mor08]. In order to name a few non-sequential approaches which intend to 
model the distribution of object trajectories, there are the approaches from 
Xiao et al. [Xia15], Luber et al. [Lub12], and Trautman et al. [Tra10] 


In summary, pattern-based methods can deal with comparatively large predic- 
tion horizons and are suited for scenarios with complex unknown dynamics. 
On the downside, this requires training samples from specific scenes that can 
not easily be pooled together. A further issue is the generalization capability. 
Pattern-based methods tend to be used in non-safety critical applications in a 
spatially constrained environment [Rud20]. In the scope of this thesis, some 
of the standard recursive filter functionality is transferred to such a pattern- 
based learning approach and particularly used in a time-critical scenario. 


Planning-based methods: As unique characteristics, planning-based meth- 
ods assume a criterion for optimal motion in the environment. By solving 
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a sequential decision-making problem, the optimal path of an object is com- 
puted. Most approaches differ in the type of objective functions that mini- 
mizes the total cost of a sequence of actions or rather motions. Thus, plan- 
ning-based methods explicitly reason about the goal of a long-term motion 
and compute policies or path hypotheses to enable to reach those goals. In 
order to estimate an optimal path, these methods rely on Markov decision 
processes (MDP) [Mur12], reinforcement learning, rapidly-exploring random 
trees (RRT) [Kar11], potential field or shortest-path algorithm such as Dijk- 
stra and A* (see for example Thrun et al. [Ihr05]). Using the motion pre- 
diction taxonomy, planning-based approaches can be classified into two sub- 
categories of forward planning methods and inverse planning methods. The 
distinction depends on the choice of the reward function. Forward planning 
methods rely on a pre-defined reward function and inverse planning methods 
aim to learn the reward function by applying statistical learning techniques 
on the trajectory data. 


Examples from the category of forward planning methods include the ap- 
proaches of Rudenko et al. [Rud17], Vasquez [Vas16], Rósmann [Rös17], Kara- 
sev et al. [Kar16a], and from the category of inverse planning methods the ap- 
proaches of Kitani et al. [Kit12], Rehder et al. [Reh18], Ziebart et al. [Zie09] 
are included. 


In summary, planning-based approaches are considered if it is possible to de- 
fine goals for the objects explicitly and a model or map of the environment 
is available. If these conditions are met, the planning-based approaches tend 
to generate better long-term predictions than the physics-based techniques 
and tend to generalize better than the pattern-based to unseen environments. 
However, in dynamic environments, re-computation of the reward function 
is required and mostly, this is time-consuming. Thus, for short-term predic- 
tion and fast changing object dynamics, these approaches are not well-suited. 
The assets and drawbacks of the introduced motion modeling approaches are 
summarized in table 2.1. 


Contextual Cues: Besides using the modeling approach to categorize the 
prediction approach, the amount of exploited contextual cues help to further 
distinguish between single approaches. The contextual cues for describing 
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the overall contextual awareness of an approach are briefly explained here. 


In the application domain of intelligent vehicles, Rasouli et al. [Ras19] pre- 


sented a very detailed survey of factors of pedestrian behavior such as de- 


mographics and environmental conditions. However, their categorization in 


pedestrian factors and environmental factors, include major and sub-factors 


which can be directly mapped to the categories of Rudenko et al. According to 


Rudenko et al. and using their terminology, the categorization of the predic- 


tion problem along the contextual cues is done based on three criteria. These 


classification criteria are defined as the contextual cues from the object itself 


(object or target agent cues), cues from a dynamic environment or cues from 


a static environment. 


Table 2.1: Summary of the assets and drawbacks of the different motion modeling approaches. 


Motion modeling 


Assets 


Drawbacks 


Physics-based 
approaches 


Pattern-based 
approaches 


+Simple, efficient, work well under 
mild conditions in particular for short- 
term prediction horizons. 
+Explainable, data efficient and gener- 
alize well with respect to unseen envi- 
ronments. 

+Possible to incorporate dynamic con- 
textual cues to models but lead to com- 


plex algorithms. 


+Learning from actual motion of ob- 
jects. 

+Reduced modeling required. 

-- Ability to capture complex dynamics. 
Long-term predictions. 

+Capture theoretically all contextual 
cues present in the training data. 
+Fast inference. 


—No reasoning over global environ- 
ment. 

— Capture only pre-defined motion dy- 
namics. 

—Large amount of engineering re- 
quired. 


—Require large amount of training 
data. 

—Limited generalization to new envi- 
ronments. 

—Low explainability. 


Planning-based ap- 
proaches 


t Generalization to new environments. 
+Explicitly reasoning on executed ac- 
tions intended on goals and map 
awareness. 

Long-term predictions. 


tion is time consuming. 


—Mandatory pre-requirement of goals 
(e.g. as semantic annotations). 
—Re-computation of the reward func- 
tion required in dynamic environ- 
ments. 

—Strong dependency on the discretiza- 
tion of action and state-spaces. 
—Re-computation of the reward func- 
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Although it is sufficient to not further differentiate the motion state cues as 
part of the object cues in respect of a taxonomy for prediction, it is crucial for 
keeping the strict separation of unobserved and observed dynamical states 
for the scope of this thesis. Instead of combined contextual cues as input 
vector representing the observed environment for a general formulation of 
the prediction function, the provided input by the tracking system is always 
considered separately. 


Instead of using 
Y = fa (CF) +e (2.13) 


to formalize the prediction problem, the following equation is used (see equa- 
tion 1.1) 


Y = fa (Z7 pub + e, 


where Y describes for a path prediction problem the future locations (or dis- 
tribution over the locations). As before, Z are the observations generated by 
the tracking system (appearance model of the visual tracker), C are additional 
contextual cues extracted from the observed image sequences, and e describes 
an additional error term. 


Thus, the scope of this thesis is to replace the fa of a dynamical system in 
combination with a recursive Bayesian filter with a learning-based solution. 
By utilizing deep learning-based approaches, it is clear that the proposed so- 
lutions fall not univocally into a single class of taxonomy of Rudenko et al. 
The starting point is physics-based multiple-model approaches, which are still 
the dominant approaches to capture maneuvers. 


With respect to the object dynamics, every object is considered separately. 
Thus these approaches are unaware of other objects. For tracking systems 
with Bayesian filters, this aspect is tackled by data-association solutions like 
multi-hypotheses tracking. As mentioned earlier, for detection-by-tracking 
the contextual cues are often implicitly used to allow the mapping to the 
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physical system under specific assumptions, but the scene context is not in- 
cluded in the modeling approach for the prediction. Thus, basic physics-based 
multiple-model are unaware of other static environment cues. However, for 
all the modeling approaches exist context-aware predictors, but due to the 
fact that learning-based approaches are best suited to integrate this kind of 
information, we will indicate for the appropriate situations how to handle ad- 
ditional context cues, but keep the focus on providing stable solutions for the 
case the tracked object is unaware of additional clues. 


2.3.2 Intention Prediction 


Although many approaches can relatively reliably predict the location of ob- 
jects a few seconds ahead, they still struggle to predict when the object will 
stop. Towards this end, intention prediction is selected as an exemplary task to 
evaluate different aspects of the proposed solution with respect to the switch- 
ing dynamics of objects. Intention prediction is an expression mainly used in 
the domain of intelligent vehicles as part of overall pedestrian behavior anal- 
ysis of vision-based active safety systems. The pedestrian intention can be 
estimated jointly with path prediction, as proposed by [Gav99], but also as 
pure classification task of a pedestrian action. Due to this close relation be- 
tween intention and path prediction, approaches for intention prediction can 
be categorized with the same taxonomy as before. Furthermore, we look at 
the problem from the Bayesian filter perspective and retain accordingly the 
modeling basis relying on the observed trajectory. 


However, the estimation of the pedestrians' intention with respect to their 
impending motion can basically be tackled with all of the approaches intro- 
duced in section 2.3.1 and mixtures of them. An essential difference is the 
short time-window for the prediction and the decision to be made due to 
the speed of the vehicle. Since physic-based methods are efficient for short 
prediction horizons and generalize well to unseen environments, the number 
of approaches in recent literature for intention prediction relying on physics- 
based approaches is significantly larger than for path prediction. The multiple- 
model approaches help to better deal with motion model uncertainties. The 


25 


2 Perspectives on State Estimation from Visual Observations 


integration of context-awareness for the predictors lead to complex learning 
algorithm. For inference, the combination with Bayesian filter is kept. 


In reviews on intention prediction or pedestrian behavior prediction [Rid18, 
Ras19], the prediction-update cycle of recursive filter is used to categorize all 
approaches originated from tracking as dynamics-based prediction. Thereby, 
the distinction between a physics-based and pattern-based approaches is lost, 
but the Kalman filter, independent of a learned or selected physical motion 
model, can be set as baseline approach. A large variety of physics-based mod- 
els describing the motion of dynamic objects in ground, marine, airborne ob- 
ject tracking, is presented in the work of Li et al. [Li03]. Popular examples of 
motion models include the constant velocity (CV) model, constant accel- 
eration (CA) model, and constant turn (CT) model. Since the publication of 
the seminal Kalman article [Kal60], as special case of Bayesian filtering, many 
extensions have been proposed. For example non-linear extensions, such as 
the extended Kalman filter (EKF), the unscented Kalman filter (UKF), or 
non-Gaussian extensions, such as particle filter (PF) [Bar02, Gri18]. 


In addition to the before mentioned physics-based single-model methods, the 
following approaches use, inter alia, a Kalman filter approach for prediction 
of pedestrian positions. Bertozzi et al. [Ber04] (EKF), Meuter et al. [Meu08] 
(UKF), and Megelmose [Mog15] (PF) use Kalman filtering with a CV model. 
In the work of Binelli et al. [Bin05] and Elnagar et al. [Eln01], a Kalman filter 
is combined with a CA model. For tracking other road users, such as bikes 
and vehicles, variants of the CT model are often utilized (see for examples 
[Bar08, Bat09]). Zernetsch et al. [Zer16] incorporated additional object cues in 
form of the resistance forces from inclination and rolling to extend the cyclist 
dynamical model. In [Sch13], Schneider and Gavrila conducted a comparative 
study on using Kalman filters with different dynamical models for pedestrian 
path prediction. An alternative approach relying also on an HMM, but with 
discrete hidden states representing intention classes, was introduced in the 
work of Wakim et al. [Wak04]. They classify the four pedestrian behaviors of 
standing, walking, jogging, and running. 
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By combining such intention class prediction jointly with path prediction, we 
end up with multiple-model approaches. For longer time-horizons the inten- 
tion of the object motion is dominated by its goals. This again illustrates 
the gentle transitions between the modeling methods. For example the ap- 
proaches of Kitani et al. [Kit12], Tamura et al. [Tam12], and Ziebart [Zie09] 
propose algorithms to learn a dynamical model yielding goal-directed behav- 
ior of pedestrians using maximum entropy inverse optimal control. Under 
the assumption that pedestrians make near-optimal decisions with stochastic 
policies, probability distributions over trajectories are predicted. 


The primary approach of multiple-model methods are referred to as multiple- 
model methods and hybrid dynamical state methods [Hof04], that augment 
the discrete motion or intention state with the continuous dynamical state. 
Following the description of Li and Jilkov [Li10] of multiple-model methods, 
they consist of the following elements. Firstly, an adaptive dynamical model 
set. Secondly, methods to deal with discrete value uncertainties, such as a 
Markov or a semi-Markov assumption. Thirdly, a recursive estimation scheme 
to deal with the continuous dynamical states conditioned on the dynamical 
model. Fourthly, a strategy to estimate the overall best by fusion or selection 
of individual filters. The combination of an HMM and a (linear) dynamical 
system is called jump Markov linear system (JMLS) [Mur12]. Other com- 
mon expressions include switching state-space model (SSSM) or switch- 
ing linear dynamical system (SLDS). For predicting cyclist intentions, Pool 
et al. [Poo17] presented a mixture of five linear dynamical models and in- 
cluded the static environmental cues by excluding single motion prediction 
not complying with the road topology. 


Instead of a JMLS, Karasev et al. [Kar16a] rely on a jump-Markov decision 
process [Mur12] to model pedestrian motion. The pedestrian dynamics is de- 
scribed with a soft Markov decision process, and the pedestrian goals are the 
hidden discrete states. Environmental cues are included with engineered re- 
ward function terms for surface types (e.g., sidewalk, crosswalk, road, grass). 


As stated before, the interacting multiple-model (IMM) filter is the most 
common inference technique applied for tracking problems [Maz98] with 
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maneuvering objects. For example [Lin16] used an IMM filter to track pedes- 
trians in the application domain of service robots. Madrigal et al. [Mad13] 
proposed an IMM filter solution in a surveillance scenario. In [Kae04], 
Kaempchen et al. tracked maneuvering vehicles with an IMM filter. In the 
particular context of intention prediction of pedestrians from a vehicle per- 
spective, Schneider and Gavrila [Sch13] proposed an IMM filter with several 
basic motion models. Kóhler et al. combined an IMM filter for pedestrian 
tracking with a support-vector-machine (SVM) to classify the intention to 
cross based on motion contour image [Bob96] in a surveillance scenario. 


In order to include contextual cues several approaches added a dynamic 
Bayesian network (DBN) [Kol09] or conditional random fields (CRFs) 
[Laf01] on top of a SLDS (see for example [Has15a, Has15b, Koo19, Bon14, 
Koo14, Sch15]) Specifically, this means that for inference the IMM fil- 
ter is applied to predict future object positions and the additional hidden 
state influences the transition probability between single dynamical mod- 
els. Hashimoto et al. [Has15b] used an DBN to consider the behavior of 
other pedestrians. In [Has15a], they included the information of pedestri- 
ans being part of a group. In accordance with Quintero et al. [Dui15] and 
Keller et al. [Kel11], Hashimoto et al. reported that it is harder to recognize 
the decision of a pedestrian to stop than the decision to cross a street. 


Kooij et al. [Koo14] presented a DBN to model the latent factors of head poses 
extracted by a head pose detector to account for inattentive pedestrians. To- 
gether with spatial cues captured by the distance of the pedestrian to the road 
curbside, the change in pedestrian dynamics is controlled. In [Sch15], Schulz 
and Stiefelhagen proposed an intention recognition system based on latent- 
dynamic CRF to integrate the pedestrian head orientation for controlling the 
motion model switches. For the estimation of vehicle trajectories with an IMM 
filter, Kuhn et al. [Kuh15] presented a DBN to embed the context of possible 
routes by a pre-defined environment geometry. 


There is a recognizable trend of integrating more and more contextual cues 
of possible causes of intention changes to better anticipate instead of react- 
ing to changes in dynamics. This, however, does not change the fact that 
quick reaction to a change in dynamics is crucial for the overall system. Even 
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though contextual cues can for example be incorporated with DBNs or other 
modeling schemes, the current predominant machine learning paradigms are 
neural networks. In particular, RNNs are the standard approach for model- 
ing sequential data. As explained in section 2.3.1, there exist an increasing 
amount of RNN-based approaches for path prediction. Our goal is to preserve 
benefits from traditional multiple-model methods to deal with maneuvering 
object, but get rid of the tedious tuning of the filters. Nevertheless, the before 
listed approaches [Ale17, Ala16, Zha19, Xue19] in the category of sequential 
pattern-based models are closely related to this work. The technical back- 
ground of the RNN-based variants is explained in chapter 4. 


At this point, we limit ourselves to several approaches applied in an intention 
prediction setting and refer to surveys such as [Rud20, Rid18, Hir18, Ras19, 
Hir18] for further reading. Not relying on the neural networks nor a multiple- 
model approach, but categorized as pattern-based methods are the works from 
[Qui15] and Keller et al.[Kel11]. In [Kel11] probabilistic hierarchical trajec- 
tory matching is used to match an observed pedestrian track with a database 
of tracklets or rather trajectory sections. Extrapolated future location from 
the best fitting sections are then combined with dynamic features extracted 
using dense optical flow inside the pedestrian bounding boxes, or as in [Qui15] 
extracted from full-body articulated poses. In both works, these body mo- 
tion dynamics are learned using Gaussian processes with dynamic model 
(GPDM) with an HMM to switch between the behavior classes of crossing 
and stopping. Quintero et al. [Dui19] included the behavior classes starting 
and standing. A pure intention classification approach was proposed in the 
work of Vólz et al. [Völ15] using a SVM to infer pedestrian crossing from ex- 
tracted tracks from LIDAR sensor. Alternatively, they proposed an regression 
forest [Vól16], and later presented an RNN-based solution [Vól19]. Gold- 
hammer et al. [Gol15, Gol14] introduced a neural network-based approach. 
Trajectories from a pedestrian head tracking system in static traffic are ap- 
proximated with a least-spare polynomial fit for a fixed input window. The 
resulting polynomial coefficients are used as input for an multi-layer per- 
ceptron (MLP) [Goo16] to predict future paths. 
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Representatively the RNN-based approach of [Sal18] is mentioned due to the 
fact that they use an RNN regression model to predict pedestrian path with- 
out the maneuver context, but distinguish for their analysis between corre- 
sponding trajectories classes such as crossing. In the context of vehicle ma- 
neuvers, such as lane changing, Deo and Trivedi [Deo18] presented an RNN- 
based model to compute maneuver-dependent vehicle trajectories. Together 
with our proposed RNN-based pedestrian path prediction network [Bec18c], 
this model serves as basis for the presented RNN-based IMM filter surrogate 
[Bec19b, Bec19a]. 


24 Summary 


In this chapter, the contributions of this thesis were positioned with respect 
to related literature for the selected application of path and intention predic- 
tion. The focus of the thesis is state estimation of maneuvering objects as part 
of a visual tracking pipeline realized as detection-by-tracking approach. For 
a high level of abstraction, the processing pipeline contains the following el- 
ements. The object observations provided by an appearance model based on 
extracted image feature, describing the object in image space. These obser- 
vations serve as input for a Bayesian filter or the proposed RNN-based alter- 
natives. We shift from a physics-based modeling to a pattern-based modeling 
of the dynamics. Both modeling approaches predict a parametric distribution 
over the object states and jointly capture maneuver probabilities for subse- 
quent processing stages. 
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In this chapter, Bayesian filtering solutions including the IMM filter for deal- 
ing with maneuvering objects are explored. After an introduction of the tech- 
nical background, design modifications compared to a basic IMM filter are in- 
troduced. Some of the results presented in this chapter have been published 
in our previous work [Bec16, Bec18a]. 


3.1 Background 


As described in section 2.2, dynamical state estimation, also known as Bayes- 
ian filtering, is a general probabilistic approach for recursively estimating an 
unknown probability density function over time using incoming observations 
and a dynamical model. In order to calculate these densities in a recursive 
fashion, the assumption of the dynamical state x* being a Markov process is 
implied. The prediction-update cycles consist of alternating estimates of the 
conditional probability density from an initial state density p(x°) at time step 
k — 0. 


In the prediction step, the predictive distribution of the dynamical state is com- 
puted. Then, the observation model hh, (xk, wk) allows to predict the ex- 
pected observation. Thus, given the conditional density pt (x*) £ p(x*|z9:*) 
the density of the predicted state at time step k + 1 is calculated according to 
the Chapman-Kolmogorov equation [Hub15] 


BoC) 4 po) = | port IR dp Ox) dx (3.1) 
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Here, gk 


is an actual observation, a realization of z“. p(x**!|x*) is the tran- 
sition density that depends on the dynamical model x**! — f* (x*, v*) and 
v* the process noise. That way, a prior estimate of the current state xl^- is 


obtained. In the update step, a newly available observation Z* 


is incorporated 
into the predictive density p~(x*). The posterior density of the dynamical 


state can be derived with Bayes' rule and is given by 
p*(x*) & p(x|z**) = n" p(z* |x") p(x"), (3.2) 


E 
with nt 8 | [pen ax) 


The term 7* represents the normalization constant, the term p(Z* |x*) is called 
the likelihood function of x* for a given observation Z* and depends on the 
E P (x*, w*) and the observation noise w^. The 
posterior p+(x*) is the probability distribution over the x* at time step k, 


observation model z* = h 


conditioned on all past observations z°** and is also referred to as belief or 
state of knowledge [Thr05]. 


Bayesian filtering provides an optimal solution for equation 3.1 and 3.2 and 
can be considered a statistical inversion problem [Hub15]. In figure 3.1, the 
prediction-update cycle of a Bayesian filter is visualized. In general, a closed- 
form solution ofthe filtering equations is not possible due to the integrals and 
multiplications of density functions involved. Thus, simplifying assumptions 
are required. 


Under the assumption of linear dynamical and observation models affected 
by Gaussian noise, the seminal Kalman filter (KF) [Kal60] is optimal and has 
a closed-form solution. For the case of w^ and v* being uncorrelated, the KF 
is an optimal dynamical state estimator in the sense of the least square errors 
and Bayesian filtering [Gri18]. Thus, the (linear) Kalman filter is a method for 
exact on-line inference in a linear DS and linear DSs are commonly used to 
describe basic dynamical models, such as a CV model. 
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Initial conditions Observations 


Prediction Update 


Figure 3.1: Visualization of the prediction-update cycle of a Bayesian filter. The filter recursively 
estimates the unknown system state x* from the observations z and estimated state 
gk-l using the dynamical model and the observation model. 


3.1.1 Kalman Filter 


For the Kalman filter, the dynamical model from equation 2.1 and the ob- 
servation model from 2.2 is restricted to linear equations. Accordingly, the 
dynamical model can be described by equation 


xkt+l = pkyk + Gk yk (3.3) 
and the observation model 
zk = H*x* + wk, (3.4) 


Hereby, F € R"*"* is the system matrix of the Kalman filter and H* € 
R"2*"x the observation matrix. The noise processes v* € R” and w* € R"z 
are assumed to be white Gaussian noise process with known covariance ma- 
trices Q* and R*. Further, it is assumed that v* and w* are uncorrelated. 
G* € R"*" is the noise gain, over which the system noise enters the dy- 
namical system: 


; > T | = 
P,, ê Cov(viv*) = E[viv* ] = (3.5) 
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f um R* for i=k 
Puw = Cov(wiw*) = E[w'w* ] = 0 for izk 
Hence, 
vi ~N(0, Q*), 
wk~N(0, RÉ), 


with 


x d T 
Pw £ Cov(viw*) = E|wiw* | 20 vi,k, 


E 
P o, = Cov(x?v^) = E[x?v* ] 2 0 vk, 


x 


Pog = Cov(x°wk) = E[x^w^ ] =0 Vk. 


(3.6) 


(3.7) 
(3.8) 


(3.9) 
(3.10) 


(3.11) 


Given that a Gaussian distribution can be represented by the two first mo- 


ments, the state estimate boils down to calculating the mean vector &* and 


covariance matrix PK, of the true state x*. The dynamical state representa- 


tion is given by 
XEN (RE PE), 


with 


PE, ê Covixkxk) = E[(x* - Elx])(x* — E[x]*"] 


= E[G* — akt — 8]. 


(3.12) 


(3.13) 
(3.14) 


The prior estimates, which are obtained during the prediction step and do not 


account for the current observation, are denoted by 25- and ph. 


gE- = E[x"]20:8=1], 


Pax = Ef(xk — 8*-)(x* — £e-)r]. 


34 


(3.15) 
(3.16) 


3.1 Background 


In the update step, the current observation is incorporated to obtain the pos- 


terior estimate: 


$^ = E[x* p: ], (3.17) 


pit = E[(x* _ &'^*yxk _ &l^*)r]. (3.18) 


At time k = 0, the Kalman filter is initialized with the prior distribution for 
the state N(&°, PO). The prediction step of a KF can be derived by using the 
transition function from equation 3.3 in the expectation computation: 


2E- = E[xkjz0:K-1] 
= E[Fxk-1 + Gy-1|g9:k-1] 

= E[Fx*-1|z9:*-1] + E[Gvk-1jz0:K-1] 

E[x*-! |g0:k-1] +0 

kbs, (3.19) 


I 


The associated covariance matrix results from equation 3.19 and 3.3: 


PST = E[(x* — £-)(x* — &'«-)T] 
= E[(Fx-! + Gvk-1 — FRk-1L+)(Bxk-1 + Gyk-1 — FRk-L+yT] 
= EIER! — £k- 5) + Gv?) 


4 
(xk-1 ne gk-L+)TpT ES yk-1 G5] 


= E[F (x* 1 _ gk L+)(xk 1_ gk L*yrg!] 


o> 


+ E[F(x*-1 — gk-1tyyk-1'GT] 


+ 
4 E[Gv 1 (x*-1 —$k-L*yrp!] + E[Gv^-1y*-1 ei 


-F E[(x* 1. gk Ly(xk 1 _ gk L*ylE'-040 


E 
-GE[v*-1y*-1 ]c' 


Pk -rPE*F!G GQk-'G". (3.20) 
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In equation 3.20, Q* reflects the uncertainty in the dynamical model. The un- 
certainty PE of the dynamical state increases for every one-step prediction 


in accordance with this equation. 


In order to calculate the corresponding posterior estimates for the update step 
on the basis of the current observation, the yet unknown quantities of ak, PK, 
PX, and PK, must first be determined. The expected value of the observation 
can be obtained under consideration of the observation model 


ak = Elz*] 

= E[Hx* + w*] 

= H* E[x*] + E[w*] 

= H£&-. (3.21) 


The difference between the actually obtained observation Z* and predicted 2* 
is called innovation or residuum. 


sk A gk — 9k 
= Zk — Hk (3.22) 


; 5 y : ^ 5 = 
where the innovation or residual covariance PE & SK is given by: 


Pk = E[(Z* — 2*)(z* - 2*)']. (3.23) 


By inserting the observation model and equation 3.23 we obtain 


Pk = E[(H*x* + wk — Hx*)(H*x* + wk — H&*)'] 


= E[H^(x* u &')x* _ zkyTk'] + E[H*(x* _ wk") 


+ El(xk — kak") + Ew wf] 


= H* E[(x* m &')(x* u xky" pak" 4 E[w*w* ] 


- T 
s*-H*PE&-H* 4 RK. (3.24) 
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The matrix PK, can analogously be determined 


PE = E[(x* — £*y(z* — 2*)'] 


= E[(x* — &*yH*x* + wk — H&*)] 
= EE — ah ext — THE] + Elak - 2w] 


= E[(xk — £^ — a) ]H*" 


- T 
-ph-H*. (3.25) 


For calculating the observation update, the posterior density (see equation 
3.2) is conditioned on the current observation. For a given observation z* the 
resulting observation update of the Kalman filter calculates the Gaussian pos- 
terior density N (xk get, PSH) of the dynamical state x with mean vector 


and covariance matrix according to 


—1 
Ret = R6- ppt heat, (3.26) 
k, k,— =l 
Pa = Px + pk PK pk. (3.27) 


The expressions 3.26 and 3.27 can be derived by conditioning the joint Gaus- 
sian distribution of x and z on z (see for example [Hub15]). Substituting equa- 
tions 3.21, 3.24, 3.25 together with 


T = 
pk =P% = HPS (3.28) 


in equation 3.26, we obtain the desired expression for calculating the posterior 
mean 


m zT 
ght = gb + PR HY (HERE H* + RE) gk — H*&-) (3.29) 
It is common to use the abbreviation 


= T = T; 
K* = PĘTHÝ cae HE 4 RO, (3.30) 
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The matrix K* is the so-called Kalman gain. Using K* the update step is given 
by: 


got = gK 4 KR gk — HERET), (3.31) 


Pat = PR -—KH'P& = (I- KHK) PÉ. (3.32) 


The KF provides equations for propagating Gaussian distributions through 
a linear system, resulting in a maintained Gaussian distribution. In case of 
non-linearities in at least the dynamical model or the observation model, this 
does not apply anymore. Further, the property that the joint density of x 
and z is also Gaussian is only satisfied for linear models. Most non-linear 
filtering approaches utilize the construction of a joint Gaussian of x and z 
and conditioning on z to derive the Kalman filter. Thus, these approaches 
aim to find an approximation for non-linear filtering problems. Some popular 
example filters are the EKF and the iterative EKF (IEKF), which approximate 
the non-linear function by using the Taylor series expansion around the mean 
of the Gaussian distribution. Other approaches, such as the UKF, approximate 
the distribution by means ofa set of points that can be propagated through the 
non-linear functions and serve to determine the new distribution parameters. 
A generalization of this approach leads to the family of PFs. 


These modification concepts of the KF can be transferred to concepts pre- 
sented in the following. Although linear models are considered, the same 
techniques, that are explained in the sequel, can be used by linearization. A 
more elaborate description of non-linear filtering extension of the KF can be 
found in [Bar02, Sár13, Thr05]. 


3.1.2. Maneuvering Objects 


In the absence ofthe problem of data association, maneuvering object tracking 
faces two interrelated main challenges: object motion mode uncertainty and 
non-linearity. Mode refers to true object motion or a pattern of behavior, and 
the dynamical model is a mathematical - usually simplified - description of the 
object motion with a certain accuracy level. Estimation is based on models, 
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approximations of the modes, which precisely describe the truth. Multiple- 
model approaches are generally considered to be the mainstream approach to 
maneuvering object tracking under motion mode uncertainty [Li10]. 


In filtering, multiple-model approaches are included in the group of adaptive 
filtering (see for example [Bar02]). In general, a well functioning filter de- 
pends on an adequate choice of Q* and R*. The concept of an adaptive filter 
considers every filter which adapts itself when it detects dynamics that the dy- 
namical model cannot account for. In object tracking, maneuvers are defined 
as model mismatch problems and in addition to multiple-model approaches, 
so-called maneuver detection based methods are also considered as adaptive 
filters. Some example methods are adjustable level process noise [Zar09], vari- 
able state dimension, and input estimation. But these methods are generally 
considered to be too slow to compensate maneuvers [Bar02]. 


As described in section 2.3.2, multiple-model approaches are the preferred 
choice when the object motion is poorly described by a single model. These 
approaches assume that the system behaves in accordance with one of a finite 
number of dynamical models. The models can differ in noise levels or in their 
structure. Such systems are also referred to as hybrid systems or hybrid dy- 
namical state methods since they augment the discrete motion state with the 
continuous dynamical state &* = (x*, m*). Multiple-model approaches can 
be further sub-divided into static and switching multiple-model approaches. 
Static approaches converge quickly to only the most probable model without 
recovering [Lab14]. Thus, a reinitialization of mismatched filters is required. 
This is accomplished by using the estimate from other models. Since the nec- 
essary modifications are a rigorously built-in part of switching multiple-model 
approaches, static approaches are not further considered. 


From the broad set of proposed switching multiple-model approaches, there is 
no clear-cut best algorithm, but the interacting multiple-model (IMM) fil- 
ter is considered to be the best compromise of low computational complexity 
and good tracking performance [Pit05]. In particular for the task of inten- 
tion prediction the IMM filter is the primary approach [Sch13, Koo19, Bon14, 
Sch15, Has15a, Has15b]. 
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The prediction problem for maneuvering objects with such multiple-model 
systems can be described as 


y- fink (#2) +6, 


where m* 


€ M = {m,,...,My} denotes the mode or dynamical model at 
time k that is in effect during the sampling period ending at k. The dynamical 


model of the linear Kalman filter from equation 3.3 thus yields 

x**! = Fk(mk)xk + G*(m')vk (mk), (3.33) 
and the adapted observation model is given by 

z* = H*(m')x* + w*(m*). (3.34) 


In case equation 3.33 and 3.34 correspond to linear dynamical systems, such 
systems are referred to as JMLS [Mur12]. Among other, the expressions SSSM 
and SLDS are also common. For the dynamical model and the observation 
model, the transition matrices are formulated depending on m*. The mode at 
time k is assumed to be among the set of possible M models 


(3.35) 


The sequence of dynamical models through time k (qth mode history) is de- 
noted as 

Mk = Imi. om^ q= l, ‚Mk, (3.36) 
where (y is the model index i at time x from history q and 1 < (); <M. 
Note that the number of histories increases exponentially with time. The 
switching between the motion models is assumed to be a Markov process with 
known transition probabilities. Thus a Markov-chain consisting of a transi- 
tion probability matrix (TPM) with 


pi; = P(m* = mm = mj). (3.37) 
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The event that model j is in effect at time k is denoted as 
my = {mk = mj] (3.38) 
and the conditional probability of the qth sequence of models 


alea = P(Mk2|z0:k), (3.39) 


The qth sequence of models through time can be written as 


wkd = pv, mr, (3.40) 
where l denotes the parent sequence with the last element mj. Due to the 
Markov property, 

P(mF yt) = PQnF|mF-!) £ pij, (3.41) 


where the index i corresponds to the last model in the parent sequence l. An 
optimal estimator for such a system calculates the following expectation 


Mk 
E[x* |29:*] = 2 E[x* Ma ,20:K]pM%-2|20:%), (3.42) 
q-1 


Thus, the conditional pdf of the dynamical state x* at time k using the theorem 
of total probability with respect to mutually exclusive and exhaustive set of 
events (equation 3.36), is a Gaussian mixture with an exponentially increasing 
number of terms [Bar02]: 


Mk 


pIE) = 3^ pakma 29: pyra ot), (343) 
q-1 
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By using Bayes' rule, the probability of a sequence of models can be obtained 
as 


aka = P(Mk4|20:k) 
= P(Mk-|zk, z0:k-1) 
= np(z*| Mm, 20:*-1pMk%2]20:k-1) 
= np ayaa, 0:1) p(mF, kM |g0:k-1) 
= np(z* wa, 20: kV P(mF ek, go k-lygk- tl 


= np(z* we, DER me ME a, (3.44) 


where n is the normalization constant. Since the current mode only depends 
on the previous, it follows: 


qd = np(zk| mA, 29: Y) P(mF mr )a*- (3.45) 
Z np(z* wa, 25 pyar, (3.46) 
where i = IX”! is the index of the last model m*-! of the parent sequence 1. 


Equation 3.46 shows that even if the model sequence is Markov, a condition- 
ing on the entire past history is required. In order to prevent a combinatorial 
explosion and to apply multiple-models in practice, approximations of the op- 
timal solution are required. The different multiple-model approaches vary in 
the way how they approximate equation 3.43. As explained in section 2.3, they 
consist of the following elements [Li10]. Firstly, the adaptive set of selected 
dynamical models. Secondly, methods to deal with discrete value uncertain- 
ties, such as a Markov or a semi-Markov assumption. Thirdly, a recursive 
estimation scheme to deal with the continuous dynamical states conditioned 
on the dynamical model. Fourthly, a strategy to estimate the overall best filter 
by fusion or selection of individual filters. 


In [Pit05], Pitre et al. compared several multiple-model methods including 
generalized pseudo-Bayesian filter of first order (GPB1), and of second order 
(GPB2) [Cha78], IMM filter, B-best based multiple-model filter [Tug82], and 
Viterbi-based multiple-model algorithm [Ave91] for tracking applications and 
showed that the IMM filter offers the best compromise between good tracking 
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performance and low computational complexity. GPB2 filter and IMM filter 
approximate equation 3.43 identically, but the GPB2 filter requires M? com- 
bined Kalman filters instead of the M combined Kalman filters of the IMM 
algorithm with similar tracking performance. A more detailed derivation of 
GPB filter can for example be found in [Bar02, Pit05]. Here, we focus on the 
IMM solution for the filtering problem. 


The basic idea of IMM is to approximate equation 3.42 or rather equation 3.43 
by 


M 

E [P20] = 5^ E[x* my 29] P(my 29), (3.47) 
j= 
M 

p(x*|?:*) = 5^ p(x* [mr 2? )P(mr |2?:*). (3.48) 
j=l 


Here, M basic filters run in parallel, and every filter is optimal for one specific 
state of the discrete Markov-chain, i.e., the dynamic model and the observa- 
tion model fit to a specific mode with regards to equation 3.33 and equation 
3.34. For the term P(m*|z°:*), the posterior mode probability, the abbrevia- 


tion ak is used. The IMM algorithm consists of three major steps: interaction 
(mixing), filtering, and combination. The following derivation of these steps is 


oriented on [Wen11, Bar02]. 


IMM-Interaction 

The first step of the IMM filter cycle is interaction. Here, the Markov-chain 
is propagated through time. Accordingly, the estimate of the motion mode 
and the estimate of the dynamical state must be adapted. Thus, the following 
transition must be determined: 


aj = akc oak! Vij, ped 
px! |ml- 1 20:k-1) > poh mF 10:571) vij. (3.50) 


The term ak ik "E 2 P(mf |^: k-1) is the probability under the condition that all 
observation up to time step k — 1 are available, and model i is in effect in time 
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step k. By applying the Chapman-Kolmogorov equation (see equation 3.1) on 
the Markov-chain, the desired transition for equation 3.49 is given by 


M M 

k- 
E Ok ik LO i= > Pija ica (3.51) 
j=l j=l 


Here, C ea E PCm&|m&=1,20:*-1) is the probability under the condition 


that all observations up to time step k — 1 are available, and model i is in 
effect in time step k for the case model j was in effect in time step k — 1. 
When applying Bayes' rule, it follows 


k-1 k-1 _ „k-1 k ak 1 
QC ik-1 jik Lk E Oki = Pij@ je 1> (3.52) 
and thus 
k-1 k-1 ak- 1 
_ Puj Pij% iki _ C ik-1 
ak _ a J J 3.53 
I — en ak1 Gi en 
ik 25 =1 Pij% jk-1 ! 


The desired pdf from equation 3.50 can now be described by 


M 
es k-1 50:k— = k k-1 š 
Pak! mf 1,2971) = $^ p(xk mi, mp) a a (3.54) 
j=l 


The weighting probabilities aki = aie ii x are referred to as mixing or inter- 
acting probabilities. Using Bayes’ rule, we get 


p (x 1 mk, mj, 25:.- Pm my, z0:k-1) 


y (3.55) 
= P(mk|xk 1 mr 120: k Dp(x* mk, 150: k 1), 
and thus 
px! mk mr, 1 20: k- 1) 
P(mk |xk=1, mk=1,20:k-1) (3.56) 


J 


-1 g0:k-1y 
P(m&|m&=1,20:k-1) 


pc! ml, 


44 


3.1 Background 


k-1- m;, but not on x*-!. Thus 


the terms of the fraction in equation 3.56 are equal. By using this simplified 


The probability of m* — m; depends on m 


version of equation 3.56, we can also simplify 3.54, the desired pdf from equa- 
tion 3.50, according to 


M 
p(x 7! mE-1,29:k-1) = Yt pox ‘mk 1 go:k bt (3.57) 
jm 


According to 3.57, the pdf of the dynamical state conditioned on model mj 
is a weighted sum of M Gaussian distributions (mixture-of-Gaussian). The 
following step after interaction, is the prediction by using an elementary dy- 
namical model. Thus, the sum of Gaussians from 3.57 has to be approximated 
by a single Gaussian distribution. The approximation by a single Gaussian 
can be done by moment matching (see for example [Bar02]). Thus, the mixed 
initial condition (mixed mean and covariance) for each filter is computed as 


M 
k-1,+ k-1,k-1+ 
%i = » Oj X ; (3.58) 
j=l 
M 
k-1,+ _ k-1 k-1,+ „k-1,+ k-1,+1/ak-1,+ k-1,+\T 
Py nr). 
j=l 


(3.59) 


Here, ge and pv are the updated mean and covariance for model j at 


time step k — 1. 


IMM-Filtering 

In the filtering step, after initialization with xi "irt and pk the Kalman 
filter equations (3.32, 3.31 and 3.19, 3.20) are applied for each individual filter. 
Correspondingly in the prediction step of the KF, the pdf of the models are 
propagated through time 


pmi 27) > ptm, Z) vi. d 
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Thus, the prediction part is given by 


k,- pk.- k- k- = 
Ix m^ |] = KFp x; RE Eso]. (3.61) 
Here the abbreviation EK corresponds to F* (mE). This also applies accord- 
ingly to Qk, HF, and RE. The model probabilities are not affected here. In the 
update step of the IMM filter, the pdf of the filters are updated 


p(x*-!|mF,z9:*) > p(x*|mFE,z**) vi, (3.62) 
and thus 

hip 2k DB HERE. (3.63) 
In addition to the parameter of the pdf, the model probabilities have to be 
adapted 

ag ae vi (3.64) 


l 
Using Bayes’ rule yields 


2 h soki k= 
p(z*|mK, z0: k- lx )P(mE|z9:* lx, ) 


P(mF|z*,z z0:k— b. xP) = 
Zhi so: k,- 
TIET. ) 


. (8.65) 


The likelihood AF of the observation for each filter is computed 


AF = p(Z*|xP-) 


1 RESI 
= —— exp (-5sf sk st), (3.66) 
4/ Qz)« [s*| 


where sk is the observation innovation, and sk the innovation covariance (see 
equations 3.22 and 3.23) of the KF update step of model mj. 
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Due to the fact that in x *~ all observations till time step k—1 are incorporated, 
it is possible to write 


z z0:k-1 -k= Feith 
p(z* mg, 2°°* x; )- px; )- Ak, (3.67) 
and we get for equation 3.65 


k ,k-1 
Aj ak 


k o "dk 
M \kk-ı 
2 Aj Hix 


e 


(3.68) 


Note that ak! is the propagated model probability, thus we can use the ex- 
pression €; from equation 3.53 to rewrite 3.68 the model probability update 
equation according to 


1 
ak = MG. (3.69) 


Here, c =, AKG is the normalization factor for equation 3.69. This 
expression is commonly used in literature ([Bar02, Lab14, Sär13]). 


IMM-Combination 

The final step in an IMM cycle is combination. The combination of the model 
conditioned estimates and covariances is done according to mode matching 
of the Gaussian mixture as follows: 


M 

ok, — kk, 

= b ex, (3.70) 
j=l 
M 

B+ =D) ak (BEY + a * a art). (3.71) 
j=l 


This completes a full IMM filter cycle. For a better overview the most impor- 
tant equations of the IMM algorithm are summarized in the following. 
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* Interaction 


Mixing probabilities: 
k-1 
ak-1 = Pi 
Ri ^ aM k-1` 
2; j=1 Pij% 


Mixed mean and mixed covariance: 


xit = apud 
DEI j 


pot pk-1+ 
P (P j 


a 1,+ k- ts 1,+ k- in, 


HX — Xi — Xoi 


* Filtering 


Prediction: 

[xtp pk- “|= KE » x5 ns piri, př- 1 Qk- if; 
Update: 

[x^ * RET] = KF, [x7 Po HE, RE]. 


Model probability update: 


k laks 
aj E Či, 
M M 
with c= Ak; €; = > par! 
i=1 j=1 
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* Combination Combined mean and combined covariance: 


“ k 
gk,+ — ax‘ + 
M 
pir = Daf (Rh 
j=l 


3.2 IMM Filter for Visual Tracking 


In this chapter, we present our adapted designs for a basic IMM filter by both 
a state de-coupling and re-coupling scheme as modifications. All filters are 
applied as top-down state estimator in a visual tracking pipeline. 


3.2.1 De-coupled IMM Filter 


The dynamical models of recursive Bayesian filters rely on explicitly defined 
dynamic equations that follow physical models such as Newton’s law of mo- 
tion. In order to apply a physics-based dynamical model, not only a good 
physical model is required, but in addition, mapping between the observa- 
tions to the 3D physical world. As described in section 2.3, the condition of 
being able to rely on mapping to the physical world can be ensured by utiliz- 
ing contextual cues to better interpret the observed scene or by including more 
assumptions about the environment. For example, in order to perform path 
prediction on ground level additional sensors (LIDAR, stereo camera system) 
or approaches like structure-from-motion (SfM) reconstruct the 3D scene. 


Although the aim of several approaches is to determine such a mapping func- 
tion, there exist several scenarios where this mapping is unknown, involves 
substantially higher expense, or is an unsolved problem. Accordingly, with- 
out implicit contextual cues to estimate a mapping function, object tracking is 
performed directly in the 2D image space. Thus, the objects are solely tracked 
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on directly mapped observations from the appearance model. A typical exam- 
ple is the tracking of objects without available external calibration. Figure 3.2 
shows two examples where the state estimate solely relies on the enclosing 
bounding box of the object. In particular, the observation models are mapped 
linearly to the dynamical state of the object. As a consequence, the dynamical 
models are abused as general-purpose models to capture the object motion. In 
other words, the dynamical models are only rudimentary models of the true 
object motion mode with relatively large process noise levels. 


Observation model: 
k 


[ I5 03x6 ] 


ta: << $4: oU M. RS Un SO 8 


Figure 3.2: Two tracked objects where the dynamical state is directly estimated from observa- 
tions provided by the output of a person detector or a visual tracker in terms of 
the center position (x,y) and the object scale (s). (Top) An example image from 
the Daimler Mono Pedestrian dataset [Enz09]. (Bottom) An example image from the 
sequence car of the VOT2014 dataset [Ceh16, Kri14]. 
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As discussed earlier, an IMM filter is a good choice for dealing with motion un- 
certainties and reducing the effect of model mismatching. The effect of model 
mismatching is clearly present in a detection-by-tracking pipeline for track- 
ing in image space. Although an IMM filter can describe more complex object 
dynamics by combining several basic dynamical models, there arise some fal- 
lacies when an IMM filter is restricted to this situation. In the following, these 
fallacies are analyzed by comparing a standard IMM filter setup to a proposed 
de-coupling of the dynamical states for the case of tracking objects only with 
directly mapped object observations. At first, a reference IMM setup for the 
desired scenario is introduced. For combining several dynamical models, the 
dynamical state of the object is described according to 


x1 = pkyk 4 v, (3.72) 
and the observation model is given by 
zk = HE xk + wi, (3.73) 


The observation noise w* € R"z is assumed to be uncorrelated to the process 
noise and modeled as white Gaussian noise process w^  N(0,R,,). For the 
goal of tracking an object on directly mapped observations, the HF includes 
only binary values. As observations provided by an appearance model, the 
unified bounding box is used. Thus, z* includes the center position (x,y) of 
the object in image IX, and the object scale s (see figure 3.2). Such information 
can be obtained from every object detector following the sliding window para- 
digm. Although common detectors differ in many aspects, the output of such 
a sliding window-based detector is a rectangular bounding box centered at 
the object location [Dol12, Enz09]. Alternatively, almost every visual tracker 
compared in the study of Cehovin et al. [Ceh16] uses the enclosing bounding 
box to represent the object state in the image. In order to choose an adap- 
tive model set, the three most common general-purpose dynamical models 
are considered. These dynamical models are the constant position (CP), the 
constant velocity (CV), and the constant acceleration (CA). Despite being 
applicable as translational models for tracking in image space, these models 
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are used for modeling the motion behavior of pedestrians for intention pre- 
diction with a single dynamical model [Mog15, Eln01, Bin05], or combined as 
multiple-model approach [Sch15, Gol14, Sch13]. Accordingly, the transition 
matrices EK can then be defined as 


I 0 
FK = 3x3 3x6 3.74 
cp | 06x3 06x6 ( ) 


for the CP model, and as 


Ia L43AT 03x3 
Fey =] 03x3 Iss 055 |. (3.75) 
055 03x3 03x3 


for the CV model. In literature, several assumptions on how to model the 
acceleration process of an object are proposed. Here, in accordance with Li et 
al. [Li05] the following CA model has been chosen 


1 
k L5 Ij4,4AT 3,35 AT* 
Fea =| 03,3 Be 13,3AT |. (3.76) 
03x3 03x3 I 


Hence, EK € (Fé. pi. FEA} with mk € {mép, mk, méa}. In case these 
models are applied in image space, every single model is used together with 
a relatively high model uncertainty. With the above choice of the model set 
structure the dynamical state x* of the reference IMM filter is given by 


u = (YS LVL . (3.77) 


In addition to a directly observed center position (x,y) and the object scale 
(s), the IMM filter uses the corresponding velocities (X,y,$), and accelerations 
(X,y,S). For standard Kalman filtering, a de-coupling of the states is redun- 
dant. Due to the characteristics of an IMM filter, both choosing a wrong single 
motion model and carelessly extending the states can lead to a non-optimal 
performance. The IMM solution to avoid the combinatorial explosion of an 
optimal filtering behavior using multiple-models is done by conditioning all 
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filters on the currently active model, and the final state estimate is obtained 
by merging the results of all base filters (see section 3.1.2). Thus, a poor es- 
timate of the active model affects the weighting of the mixed inputs in the 
interaction step. Thereby, a combination of the location and the scale in a sin- 
gle state vector can result in errors in the calculation of the model probability, 
especially when combining the scale with the image position. For example, 
the scale change of an object can be constant while the object is moving. Thus, 
the best fitting model for the scale is CP, although this model is a poor fit for 
the image position. Therefore, we propose to de-couple the state estimate. 
In practice, this is done by using an additional IMM filter. Hence, the scale 
and the corresponding velocity, and acceleration are estimated independently 
from the position states and their derivatives. This first state separation step 
leads to the following IMM configuration: 


"WE ias T T 
Sine [Dey Syon = xt ‚RE | (3.78) 


Thus, the state estimation problem can be written in terms of two de-coupled 
state sub-vectors 
k+1 k k kk 
"aL : : 
x" Ogg Ej X) Gv? j 

A separation of the scale with an additional filter seems obvious, but when 
tracking with directly mapped image space data, a split into independent im- 
age coordinates may not immediately appear to be necessary. In order to 


show the benefit of such an IMM setup, we recommend the following IMM 
configuration: 


ga s sun T T uU 
tars = [exe] Dr 15551" = [xt ob ax] > (3.80) 
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The proposed de-coupled estimator results in 


k+1 k k kk 
Xi E; 05.5 05.5 x] Gi Vij 

kt |_| 9 F 0 k GEv* 3.81 
X2 = 3x3 #2; 03x3 X; |+ 2V2; |. 681) 
x 055 03x3 FE i x5 Givi j 


Here, three IMM filters are used to describe the x position, y position, s scale, 
and corresponding derivatives. Thus, every motion along the image axes is 
captured with a separate filter. 


The strategy of de-coupling state estimates and basing the estimator on 
reduced-order filters has hitherto been mainly used in air traffic surveil- 
lance. In this field, usually, an aircraft's motion in the horizontal plane is 
independent of its vertical motion [Bar02]. For example, in the work of 
Yeddanapudi et al. [Yed97] and Wang et al. [Wan99], the aircraft motion 
in North and East direction is estimated with a separated filter to a second 
filter for the vertical state (the altitude and the vertical velocity). Besides the 
explained benefits, especially for an IMM filter restricted to directly mapped 
image space observations, the computational complexity of a de-coupled 
system is also reduced compared to a system using only one state vector. 


3.2.2 Evaluation: De-coupled IMM Filter 


The above described IMM configurations are evaluated on the VOT2014 data- 
set [Ceh16, Kri14]. This dataset is a selection of 25 prototypical object tracking 
sequences. Although the dataset is originally designed to compare different 
visual trackers, it includes a variety of different object motions from differ- 
ent object categories. Some example sequences from the VOT2014 dataset 
are depicted in figure 3.3. Due to the fact that the object type, the capturing 
sensor, and the tracking scenario differ strongly, this dataset includes several 
situations in which the estimation of the mapping function to a 3D physical 
reference system is an unsolved problem or would require high expense. 
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sequence bicycle 


sequence gymnastics 


sequence surfing 


time in frames 


Figure 3.3: Example tracking sequences from the VOT2014 dataset [Ceh16, Kril4]. Unified 
bounding boxes of the objects are shown for the sequences bicycle, gymnastics, and 


surfing. 


Settings: The overall performance depends on a number of design parame- 
ters. The most critical design parameters are the model set structure, process 
and observation noises, initial state, and the jump structure given by the tran- 
sition probabilities. Although, the basic IMM setup with three standard mo- 
tion models (CP, CV, CA) is sub-optimal for some scenarios of the VOT2014 
dataset, this combination is kept fixed. In practice, the TPM is often assumed 
to be known and is chosen a priori. As stated in Bar-Shalom [Bar02], an ad-hoc 
approach is to fill the diagonals with values close to 1. We set the diagonals to 
0.99 and the remaining transition values to 0.005. The IMM filter is relatively 
insensitive to small changes in TPM. Since the CV model is the most widely 
used in tracking approaches, we set the initial model probability af in favor 
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of this model to 0.98 and to 0.01 for the other models. The observation noise 
is assumed to be an additive white noise. The process noise is modeled as the 
acceleration increment during a sampling interval (discrete Wiener process 
acceleration). In the experiments, the variances of both noise processes were 
varied between 1, 2, 5, and 10. 


Table 3.1: Results for the comparison between different IMM de-coupling configurations on the 
VOT2014 dataset. Settings: 03 = 2, 0%, = 5, tupdate = 3 


failure rate average IoU 
sequence IMM1 IMM2 IMM3 IMM1 IMM2 IMM3 
ball 0.270 0.191 0.164 0.634 0.679 0.695 
basketball 0.003 0.003 0.004 0.863 0.884 0.891 
bicycle 0.304 0.233 0.173 0.602 0.641 0.692 
bolt 0.080 0.044 0.027 0.774 0.810 0.842 
car 0.340 0.261 0.108 0.610 0.642 0.710 
david 0.167 0.152 0.141 0.697 0.715 0.720 
driving 0.082 0.135 0.135 0.793 0.749 0.736 
drunk 0.000 0.000 0.000 0.931 0.929 0.931 
fernado 0.018 0.021 0.018 0.857 0.852 0.859 
fish1 0.242 0.144 0.111 0.663 0.726 0.725 
fish2 0.107 0.084 0.064 0.745 0.769 0.775 
gymnastics 0.107 0.138 0.138 0.798 0.787 0.786 
hand1 0.262 0.232 0.227 0.656 0.664 0.665 
hand2 0.410 0.379 0.359 0.542 0.559 0.576 
jogging 0.047 0.054 0.047 0.769 | 0.776 0777 
motocross 0.092 0.188 0.188 0.752 0.745 0.755 
polarbear 0.011 0.008 0.011 0.848 0.849 0.852 
skating 0.000 0.000 0.000 0.866 0.881 0.898 
sphere 0.295 0.300 0.316 0.618 0.629 0.605 
sunshade 0.559 0.571 0.484 0.426 0.442 0.488 
surfing 0.111 0.081 0.048 0.699 0.746 0.773 
torus 0.150 0.146 0.142 0.706 0.723 0.712 
trellis 0.358 0.276 0.238 0.579 0.633 0.661 
tunnel 0.129 0.078 0.101 0.695 0.734 0.729 
woman 0.051 0.053 0.051 0.773 0.794 0.805 
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Results: For every image sequence, the first 10 frames are excluded and used 
for initializing the filters. The update interval tupdate for getting a new ob- 
servation for the filter has been varied between every single, every third, and 
every fifth frame. Since the standard outputs of object detectors are a rectan- 
gular bounding box centered at the object location, we use the ground truth 
bounding boxes from the VOT2014 dataset for evaluating the prediction ac- 
curacy. Performance measures aim at summarizing the extent to which the 
tracker's prediction agrees with the ground truth annotation. In Cehovin et 
al. [Ceh14], a general definition of an object state description in a sequence 
with length N is established based on the center of the object and the region 
of the object at time k. From the IMM filter, the predicted center location x, 
y, and scale s are used to calculate an unified bounding box A. The overlap 
between the predicted and the ground truth region can be calculated as 


_ (46 9 Abr! 


IoU = — 3 
|Â% U AGr| 


(3.82) 


The overlap ratio is often referred to as intersection-over-union (IoU) or 
Jaccard index [Jac08]. For the ground truth area Apes also an unified bound- 
ing box is considered. In general, the width of the enclosing bounding box 
is more strongly influenced by the body pose of the objects. Hence, a unified 
bounding box with a width of 1/3 of the bounding box height is used. Although 
the selected ratio is better suited for object categories such as pedestrians, this 
ratio also works for object categories with deviating width to height ratio to 
achieve a good assignment between observation and prediction. A property 
of the IoU is that they simultaneously account for position and size. Thus, 
there is no need for additional normalization considerations. The IoU is sum- 
marized over an entire sequence by an average IoU. In addition to the average 
IoU, the failure rate is used as a second comparative score. The failure is the 
number of frames in which the IoU is below a threshold of 0.5 is recorded and 
is computed as 
number of frames IoU < 0.5 


fail te = —— —xc—. 3.83 
ee number of frames ( ) 
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Figure 3.4: Visualization of the failure rate for the sequences bicycle and surfing with an overlap 
threshold of 0.5 for o2 = 2, 02, = 5, tupdate = 3. 


The overall results for the three different IMM configurations are exemplary 
summarized for 02, = 2, o? = 5, tupdate = 3 in table 3.1. The correspond- 
ing visualization of the failure rate for the sequences bicycle and surfing (see 
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figure 3.3) with an overlap threshold of 0.5 is shown in figure 3.4. Other para- 
meter settings may differ slightly, but are equal at their core. This means that 
the achieved overlap varies and that for some specific sequences, the ranking 
of the IMM configuration changes. Overall it can be noticed that the IMM 
configuration that uses separated image space coordinates and scale, outper- 
forms the other configurations. Due to the fact that the motion of objects 
in some particular sequence is highly non-linear, the chosen combination of 
motion models is not optimal. Moreover, this can also result in a changed 
ranking, but the trend towards the third configuration for achieving superior 
results is visible for all evaluated parameter settings. It should be noted that 
in particular two sequences (gymnastics and diving) do not comply with the 
results achieved on the other sequences. Firstly, both include objects which 
execute a rotation. Here, we used unified bounding boxes and associated the 
height with the object scale. When the object is rotating, this association is 
wrong and leads to highly non-linear motion patterns due to the change in the 
bounding box orientation. Under such conditions, the current rotation should 
be considered in the state of the object or otherwise it is not possible to asso- 
ciate the object scale with the bounding box height. Besides, it is inevitably 
difficult to produce uncontroversial ground truth boxes for rotating objects. 
Hence, the annotations for these sequences include a stronger ambiguity for 
the enclosing bounding boxes, and are erroneous in some cases. 


Due to the fact that the ground truth includes an uncertainty and the over- 
lap values for ranking the different filter configurations are in some cases very 
close to each other, a hypotheses test is applied as follows. The VOT2014 data- 
set contains 25 videos, and the evaluation has been structured such that each 
sequence provides a single data point that can be used to conduct a pairwise 
test between the IMM filter configurations. Given two IMM configurations, 
a sequence is categorized in favor of one configuration based on the average 
overlap. Equal values are excluded. The counts for these cases will follow a 
binomial distribution. Furthermore, if the setups are equivalent, the proba- 
bility of one versus the other should be 0.5. Binomial statistics can then be 
used to compute a p-value and determine whether or not to reject the null 
hypothesis that both configurations are equivalent. 
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The setup for the applied exact binomial tests are listed in the following and 
the results are summarized in table 3.2. 


* Hypothesis statement: Compared IMM configurations perform 
equally well. 


e Null hypothesis : Ho: R = 1⁄2 
e Alternate hypothesis: Hj: R # 1⁄2 
« Test distribution: Xest ~ Bin(Nest,P) 


e Test statistics: L = Number an IMM configuration performs better for 
a test sequence. 


e p-value (two-sided): p-value = 2- x (Mes) — PyyNeese—! 


e Significance level: Atest = 0.05 


Table 3.2: Statistical hypothesis tests for the different IMM configurations on the VOT2014 data- 
set based on the results from table 3.1. 


Test: IMM 3 vs. IMM 2 Test: IMM 3 vs IMM 1 Test: IMM 2 vs IMM 1 
IMM 3 > IMM 2 19 IMM 3 > IMM 1 21 IMM 2 > IMM 1 20 
IMM 3 < IMM 2 6 IMM 3 < IMM 1 3 IMM 2 <IMM1 5 


p-value (two-sided) 0.0146 p-value (two-sided) 0.0003 p-value (two-sided) 0.0041 
Null hypothesis rejected Null hypothesis rejected Null hypothesis rejected 


In all cases, the null hypothesis is rejected due to p-values indicating that the 
differences in performance were significant. As mentioned, other parameter 
settings lead to only slight variations in performance and can therefore lead 
to a less distinctive result for a specific setup. Nevertheless, the overall trend 
towards a de-coupled IMM filter shows that this configuration is an improve- 
ment over the other IMM filter configurations. An improvement can also be 
perceived by de-coupling location and scale. Thus, the second configuration 
(IMM 2) outperforms the naive state extension (IMM 1). This state de-coupling 
is also recommended when the actual motion is described in 3D. These results 
follow the intuition that when tracking an object directly in image space, the 
motion in a particular direction can be independent from the other direction. 


60 


3.2 IMM Filter for Visual Tracking 


Although this assumption is not always true, it is commonly used for many 
filter designs, including single model variants. Because the base filters are 
conditioned on the best fitting model, the final estimate is negatively influ- 
enced by a naive extension of the state vector. Figure 3.5 illustrates this effect 
by visualizing the model probabilities of the de-coupled IMM filter. There, the 
individual model probabilities differ for all de-coupled states. For combining 
the scale and its derived changes with the actual motion states this seems ob- 
vious. On the contrary, the presented results show how crucial this can also 
become for mixing image coordinates. 


In summary, when relying on directly mapped observations, which is com- 
mon for tracking in image space, a naive extension of the state vector should 
be avoided. However, the fact that independent states are affected by mixing 
the inputs from the base filters, which is a result of the required approxima- 
tion for optimal filtering, can easily be overseen when applying IMM filters 
for tracking in image space. With this simple reminder, a better IMM filtering 
can be achieved. While the overall performance can be further improved by 
selecting alternative motion models which better fit to the dynamics ofthe ob- 
ject in the scene, it is also crucial to not just naively extend the state. All states 
of an IMM state vector should depend on each other and thus, each additional 
independent state and its derivatives should be considered in an additional 
IMM filter. Thus, the conditioning on the current best fitting model can not 
negatively affect the overall performance. The motion of an object in image 
space is a good example of a case in which the dynamics along the image axes 
should be considered independently when applied to an IMM filter. 
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3.2.3 Re-coupled IMM filter 


Figure 3.6: Example images of detected persons with the approach of Kieritz et al. [Kie16] on 
the Daimler Mono Pedestrian dataset [Enz09]. 


In this section, a state re-coupling scheme for the IMM filter configuration is 
introduced. The proposed re-coupled IMM filter provides an online adaptation 
scheme of the system noise parameters in order to better capture location un- 
certainties pertaining to image coordinates. The observation z* for tracking 
in image space is often obtained from a object detector. As an example, figure 
3.6 shows detected persons with the approach of Kieritz et al. [Kie16]. The 
applied person detector follows the widely used sliding window paradigm. 
Thereby, for every window location and scale a binary classification is done. 
The classifier consists of weighted decisions trees using the integral over a 
rectangular region of a feature channel as nodes, and are generated by boost- 
ing [Fre97]. Thus, the output of such detectors, as well as that of most visual 
tracking approaches, is a rectangular bounding box. Similar to section 3.2.2, 
the performance of person detectors is also measured by the IoU between the 
detector output and the ground truth bounding box. A standard threshold 
for a detector output to be categorized as true positive is 0.5 (see for example 
Dollar et al. [Dol12]). 


The IoU criterion bears the risk that the observation noise scale dependency 
is easily overseen. For sequences where the range of person or other object 
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scales is very limited, a fixed observation noise variance is adequate. Nev- 
ertheless, it is intuitively clear that the location accuracy is scale-dependent. 
For demonstrating the scale-dependency of the observation noise, we evalu- 
ated our person detector (Kieritz et al. [Kie16]) on the MOT16 dataset [Mil16] 
and the Daimler Mono Pedestrian dataset [Enz09] for covering a broad range 
of person scales. Only detections with an overlap greater than 0.5 are consi- 
dered in the analysis. Without applying a non-maximum suppression on the 
detector output (multiple detections are thereby associated with a single an- 
notation), a total number of 450826 detections were compared to the ground 
truth data. By dividing the range of person scales into several intervals, the 
effect of scale dependency for the pixel distance in x and y direction can be 
seen. When the pixel distance is normalized by the object scale, a zero-mean 
error distribution with a relatively constant variance over the chosen scale 
intervals could be observed. This is shown for the error in x in figure 3.7. It 
can be clearly seen that the displacement or error is larger for higher person 
scales. Normalized with the true scale, the standard deviation of the error is 
nearly constant. 


= s« 200 — s<200 
== s»300 M == s>300 


relative frequency 
relative frequency 


error X in pixels error x/s 


Figure 3.7: (Left) Distribution of the detection error in X for persons smaller than 200 pixels 
and person greater than 300 pixels. (Right) Distribution of the detection error in x 
relative to the ground truth person height. 


In order to take the shown dependency between the noise levels and the scale 
ofthe object into account, we propose a re-coupling of states of the IMM filter. 
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The overall state-space model of the proposed IMM filter can be expressed as: 


xp Ff; 055 03x3 xf Givi 

xj" |=| 055 Ef; 055 x |+ Givi, ; (3.84) 
x 03.3 03x3 FX; xt Givi, 

zt Hi; 03,3 Df; xf 05 

z |=] 0,5 Hj Di || xb |+| Osa |. (3.85) 
zi 03x3 055 Hi; xj wii 


Here, D* is a weighting matrix for tuning the scale-dependent noise. In the 
overall dynamical model, we realized a coupling of the scale dynamics to 
image coordinate dynamics which is very similar to augmenting a state for 
dealing with time-correlated noise (see for example Wendel et al. [Wen04]). 
It should be noted that the scale has to be strictly positive, but assuming a 
Gaussian distribution for the filter does allow negative values. Since the de- 
tection of persons requires a reasonable scale, the problem can be neglected 
in most cases. By using a relative scale, this problem can be avoided and 
even more important, the noise process can be modeled by a zero-mean white 
sequence. Without using the relative scale change, the noise of the scale 
is state-dependent. For considering this situation, we refer for example to 
Spinello et al. [Spi10]. In the case of visual tracking, it is common to describe 
the scale change relative to the initial scale. Thus, the state vector x can be re- 
placed with the relative scale change s, = s/s; and the corresponding velocity 
and acceleration [5,85]. In sequences where the range of person scales is 
very limited, like the VOT2014 dataset, a fixed setting of the noise variances 
is sufficient. But in scenarios with a relatively fast change of the object scale, 
for example images collected from a driving vehicle, the proposed IMM filter 
helps to avoid an under- or over-estimation of the system uncertainties. 
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3.2.4 Evaluation: Re-coupled IMM Filter 


The benefit of the complete scheme of re-coupling IMM filter states is eval- 
uated on 10 selected sequences from the Daimler Mono Pedestrian dataset 
[Enz09], which was captured on-board a vehicle driving through an urban en- 
vironment. Figure 3.8 depicts the trajectories of person ground truth boxes, 
where the increase of the scale is clearly visible. Although in such a scenario, 
the object motion is preferentially modeled in an ego-motion compensated 
vehicle centered coordination system on ground level, here no further con- 
textual cues in order to enable an association between the observation and 
the 3D environment are considered. Nevertheless, the object type and sen- 
sor setup are known in this case, the constraint of tracking directly in image 
space is kept. The modeling of pedestrian motion relying on physical motion 
models is discussed in section 4. 
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Figure 3.8: Visualization of trajectories of person bounding boxes from the Daimler Mono Pe- 
destrian dataset [Enz09]. 


Settings: For the following evaluation the model set combination from sec- 
tion 3.2.3 is used. We also kept the previously chosen transition matrix and 
the initial probability values. The fixed observation noise standard deviations 


for the x and y dynamics were set to [v 0.03 - So/pixels| and the adaptive ob- 
servation noise levels were set [v 0.03 - pisei |. The noise levels for the scale 
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dynamics were modeled as white Gaussian noise. The process noise stan- 
dard deviation was set to [V 0.1- Sofpixeis for all three filters and modeled as 
the acceleration increment during a sampling period. Since the error for the 
scale estimation is identical, it is excluded from the evaluation. Thus, only 
the image location error e = /(xgr — €)? + Wor — 9)? in form of the root- 


mean-squared error (RMSE) is considered. For simulating different per- 
son detectors, the ground truth bounding boxes were used. Zero-mean white 
Gaussian noise was added to the ground truth center location and scale. For 
taking the scale-dependency into account, the additional noise term was set 


to N 0.04 - s* f pixels. 


Table 3.3: RMSE analysis for the de-coupled and re-coupled IMM filter on selected sequences 
of the Daimler Mono Pedestrian dataset [Enz09]. Settings: Oy = [v 0.1 - Sofpixels|, 


Ow = E 0.03 - So/pixels| 


sequence frame numbers — @ge¢ Ce gee &rec Dose — lrecldec  Wpecidec 
01 2678 - 2708 4.330 4.174 3.978 2.962 0.886 0.168 
02 2875 - 2900 3.425 2.580 3.350 2.079 0.964 0.115 
03 4686 - 4712 11.321 16.153 9.822 11.169 0.882 0.195 
04 4892 - 4921 3.668 7.003 6.432 5.615 0.923 0.128 
05 5974 - 6016 6.301 6.905 5.879 5.615 0.886 0.180 
06 11047 - 11076 4.679 3.922 4.623 3.795 0.986 0.110 
07 11248 - 11283 9.584 7.762 9.111 7.086 0.935 0.129 
08 11485 - 11521 8.436 7.381 8.136 6.882 0.951 0.111 
09 11796 - 11842 5.365 5.887 4.821 4.362 0.852 0.230 
10 17342 - 17366 10.711 10.555 9.594 7.578 0.906 0.165 


Results: The results for the described setup for N, = 1000 runs is shown 
in table 3.3. Here, € is the average RMSE and oc, the corresponding standard 


deviation. The last two columns include the average ratio of the sum of the 


1 N 
RMSE (frecjdec = z Èo 
r 


deviation. In these experiments, the adaptive noise tuning was done step- 


N " 
ek rec)/ (2,6 &k,dec)) of a sequence and its standard 


wise by mapping it to fixed levels in order to avoid an oversensitive tuning. 
Alternatively, the current uncertainty of the scale estimate can in addition be 
considered for preventing an oversensitive noise adjustment that might result 
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in an erroneous assessment. As can be seen in table 3.3, the results achieved 
with the de-coupled filter is inferior to the re-coupled filter. The difference 
in terms of the average RMSE is small, but the variance for the re-coupled 
filter is lower. The results show the benefit of a re-coupled filter and proto- 
typical situations for applying it. For sequences where the change in scale is 
less pronounced, like sequence 06, both filters perform equally well, but the 
re-coupled filter bears the risk of a too strong noise adaptation because of the 
scale estimation uncertainty. This is also consistent with modeling the de- 
tection uncertainty with an additive fixed term, where the performance can 
accordingly shift towards the de-coupled IMM filter. But in sequences with 
rapid scale changes, the advantage of the re-coupled filter gets more signifi- 
cant. This can be seen from the lower r,ec/dec values for the re-coupled filter 
for the chosen settings. 


The results from table 3.3 mainly conduces to illustrate some effects for track- 
ing solely in image space. But it also shows some limitations of the chosen 
linear filter setup for scenarios captured from a driving vehicle. However, the 
effect of fixed noise levels can directly be derived from the above error dis- 
tribution analysis of the person detector. The filter gain multiplies the prior 
uncertainty PEHK with the inverse observation uncertainty (residual co- 
variance: S* = Hep HE + RÝ; see equation 3.30). Although a division 
is not defined for matrices, we can think of the Kalman gain as a ratio that 
controls the influence of a new observation on the updated (posterior) state 
estimate. For example in sequence 03, the ground truth scale of the fully 
visible person changes from 103 pixels to 360 pixels. As mentioned, a true 
positive is considered up to an IoU of 0.5. Hence, an admissible correct de- 
tection to ground truth association can result in relatively high localization 
errors. Hence, a suggested error confined to y can result in an error of half 
the person scale. For sequence 03, up to 180 pixels. Assuming that the prior 
uncertainty and the state estimate from the last frame are identical, the fil- 
ter gain difference only depends on the noise variance. Thus, a small noise 
variance would strongly underestimate the observation uncertainty and lead 
to almost complete correction of the estimated position to the measured po- 
sition and vice versa. 
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Figure 3.9: Comparison between estimated trajectories with a re-coupled ( m) and a de-coupled 
m a ) IMM filter, and the corresponding ground truth trajectory (ms). The noisy ob- 
servation is highlighted in lime (9). In the examples from sequence 03 and sequence 
10, the positive effect of adjusting the observation noise level is visible. 


Besides the fact that the overall performance of person detectors increases 
for close ranges or rather large scales [Dol12], this is not true for the local- 
ization accuracy (see figure 3.7). In case the underlying detection and local- 
ization scheme improves for larger scales, a fixed observation noise level and 
de-coupled filter setup is sufficient. This also implies that the commonly used 
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IoU value as a reference value for considering a true positive detection is only 
adequate for assessing the detection task, but should be more restrictive for 
larger scales, especially in combination with tracking. The combination of a 
filter and a detector with very large image location uncertainties is also im- 
practical. 


In summary, when tracking solely in image space the observation uncertain- 
ties provided by commonly used person detectors or visual trackers are scale- 
dependent. The proposed re-coupled IMM filter helps to improve dealing with 
these conditions. The advantage of the re-coupling scheme for an IMM filter 
is more significant in scenarios where tracked objects cover a broader range 
of scales, or their scales show a high dynamic (see figure 3.9). 


3.3 Assets and Drawbacks of IMM Filters 


A maneuver is any motion characteristic that an object is performing other 
than the dynamical model used by the filter. In case the object maneuvers are 
in a set of finite number of models, multiple-model approaches are the pre- 
ferred choice to deal with such a model mismatch. However, the tuning of a 
filter, the choice of its design parameters, requires a large amount of engineer- 
ing. Since tuning of filters aims to systematically connect the filter parameters 
to physical system parameters, this is extremely hard for tracking directly in 
image space. Thus the filter setup consists of simplified models of the true 
motion mode with large noise levels to deal with the model uncertainty. The 
proposed modifications of a basic IMM filter design help to improve the filter 
performance, in particular for tracking in image space. 


It is clear that for most application scenarios a mapping function to 3D is cru- 
cial to further improve the tracking performance. Nevertheless, filter tuning 
requires still a large amount of engineering and adequate physical models. To 
overcome the limitation of an IMM filter, we propose RNN-based alternatives 
that obtain the key abilities of an IMM. We shift from a physics-based model- 
ing to a pattern-based modeling of the object dynamics. The goal is to describe 
complex object motion and capture the intention of switching in dynamics. 
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Thus, the probabilities of the currently performed motion dynamics that is 


crucial for higher-level processing should also be provided by the system. 


The assets and drawbacks of the IMM filter are summarized in table 3.4. 


IMM filters 


Table 3.4: Summary of the assets and drawbacks of IMM filters. 


Assets 


--Most common architecture to capture 
switching dynamics. 

+Ability to describe complex object 
motion and capture latent intention. 
+Probability of motion dynamics that 
is currently performed. 

+Low computational complexity. 


Drawbacks 


—Large amount of engineering 
required (model set structure, sys- 
tem/process and observation noise, 
jump structure and transition proba- 
bilities). 

—Limited expressive power. 

—Physical model required. 
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In this chapter, the RNN-based solutions for dealing with maneuvering objects 
are explored. For the exemplary tasks of path prediction and intention predic- 
tion, their behavior with regards to model mismatch is analyzed. Thereby, 
path prediction is mainly used for comparison to related approaches for mo- 
tion prediction methods due to the fact that there exists a public standard 
benchmark (section 4.2.1). Intention prediction, on the other hand, is well- 
suited to give a detailed evaluation of the switching behavior (section 4.2.2). 
The first part of this chapter introduces some theoretical background required 
for the later proposed deep learning-based filter alternatives. This chapter is 
partly published in [Bec18c, Bec18b, Bec19a, Bec19b]. 


4.1 Background 


Again, we start with the formalized prediction problem, (see equation 1.1) 
y- Íh (ZOK pu] 4 €, 


where Y describes the future states (or distribution over the states) of a trajec- 
tory, Z are the observations generated by the tracking system, C additional 
contextual cues extracted from the observed image sequences, and e describes 
an additional error term. As discussed in section 2.3, modeling motion and 
modeling contextual cues are two different aspects of the motion prediction 
problem. Without loss of generality, the contextual cues C°' are initially left 
out. 
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Our aim is to replace the fa of a dynamical model with a deep learning so- 
lution. In the case of a fixed observation window and realizing the function 
approximator as a regression problem, learning of the network parameters 
can be realized by an MLP, a feed-forward neural network, with an appropri- 
ate distance function for the predicted trajectory. MLPs are the quintessen- 
tial deep learning models. The term feed-forward is used because informa- 
tion flows through the computational graph of the network from the input 
in general or here the fixed-length observed trajectory 30°, through the in- 
termediate computations used to define f(.), and finally to the target state y 
[G0016]. Thus, an MLP defines a mapping fg [gU) and learns the parame- 
ters 0 that result in the best function approximation. MLPs have no feedback 
connections to feed model outputs back into itself. If feedback connections 
are included, such networks are refereed to as RNN. By drawing the connec- 
tion between dynamical systems and RNNs in advance (see section 2.2), we 
motivated the transfer to a deep learning solution. However, MLPs are a con- 
ceptual stepping stone on the path to RNNs, which power our proposed IMM 
alternative and are thus explained in brief. 


4.1.1 Multi-Layer Perceptron 


MLPs combine several interconnected perceptrons together. Whereby a per- 
ceptron is a type of elemental neural unit (neuron) which has an input vector 
z € R"z and one scalar output o [Ros58]. The output is $(-) applied to the 
dot product of its inputs and a bias term b 


o=d(w'z+b). (4.1) 


Here, w € R"z denotes weights for all the inputs, and $ a non-linear func- 
tion referred to as activation function. Commonly used activation functions 
include the identity function, the sigmoid function, the hyperbolic tangent 
function, the ReLU (Rectified Linear Unit) function, and the Leaky ReLU func- 
tion. By combining several neurons together, MLPs are able to approximate 
arbitrary non-linear mappings. Thus, an MLP is a feed-forward network of 
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neuron layers where the neurons of one layer are only connected to the neu- 
rons of the previous layer. The output of the ith layer is given by 


l; = € CWili.; +b;) (4.2) 


where lọ = Z is the input and ly = 0 being the output of an N-layer MLP. 
W; € R"'m*'z is the weight matrix of layer i with W; = [w;1, win]. 
being the combined weight vectors of its m neurons. All weight matrices W 
and bias vectors b are the trainable parameters 8 of the network controlling 
the behavior of fa. Figure 4.1 illustrates the structure of an MLP consisting 
of an input layer, an output layer, and one hidden layer. Due to the pairwise 
connections of neurons between layers, MLPs are also referred to as fully 
connected layers. The number of parameters quickly rises with an increasing 
number of neurons in the network. Although the term MLPs is sometimes 
strictly used for a class of feed-forward networks composed of multiple layers 
of perceptrons with threshold activation, here the term MLP refers loosely to 
any feed-forward network without being restricted to particular activation 
function including radial basis function networks [Bro88]. 
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Figure 4.1: Visualization of an MLP. 
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The parameters of feed-forward neural networks can be efficiently learned 
with stochastic gradient descent (SGD) together with the backpropagation 
[Rum88] algorithm. Backpropagation is an efficient technique of computing 
the gradients in directed graphs of computations, such as neural networks. 
The term backpropagation is the abbreviation for backpropagation of errors, 
where the errors are defined by the distance function, such as the distance to 
the path being predicted. Further details on backpropagation and optimization 
methods are given in 4.1.3. An MLP or fully connected feed-forward network is 
designed to have separate parameters for each input feature so that it learns all 
of the rules of the object motion separately at each position in the trajectory. 


4.1.2 Recurrent Neural Networks 


In order to extend MLPs for an improved processing of sequential data with 
variable input length, RNNs share parameters across different parts of a 
model. As explained in section 2.2, RNNs are extensions of MLPs, where 
hidden units H = {hk : k € N} are used to encode an internal latent state 
space. Unfolding a recurrent computation into a computational graph that 
has a repetitive structure results in parameter sharing across the network. 
The unfolded model structure corresponds, similarly to recursive Bayesian 
filtering, to a directed acyclic computational graph. Thus, the recurrent 
network processes information by incorporating it into the hidden state that 
is passed forward in time. As shown in equation 2.7, the hidden state for one 
time step can be given by 


hk+! = f, (h*,z**!). (4.3) 
A basic RNN [Elm90] can be defined as 


hk+! = $ (W,,h* + W;,z**! + b;), (4.4) 
o* =  (W,,h* + bo). (4.5) 


WẸ) represents the weights, b() biases of a recurrent layer, and X.) an ac- 
tivation function. The unfolding process introduces two advantages. Firstly, 
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regardless of the trajectory length the input size is kept fixed. Secondly, it 
is possible to use the same transition function with the same parameters for 
every step. A single shared model allows to generalize to trajectory lengths 
not included in the training set. 


The ideas of graph unrolling and parameter sharing enable the design of a 
wide variety of RNNs [G0016]. The most effective sequence models used 
in practical applications are the so-called gated RNNs. These include long 
short-term memory (LSTM) [Hoc97] and networks based on the gated re- 
current unit (GRU) [Cho14]. Together with the standard RNN, these variants 
are used in most of our experiments. The aim of gated RNNs is to reduce the ef- 
fects of exploding and vanishing gradients during parameter learning [Ben93, 
Pas13]. They rely on the idea of creating a path through time and connecting 
weights that may change at every time step. In [Gre17], Greff et al. conducted 
a comparative study between different variants of gated RNN architectures for 
the task of speech recognition, handwriting recognition, and polyphonic mu- 
sic modeling. The results show that none of the variants could significantly 
improve upon the LSTM. 


Figure 4.2: Visualization of a standard RNN unit. 
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Figure 4.3: Visualization of an LSTM unit. 


Figure 4.4: Visualization of a GRU unit. 
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The computation of RNNs can be divided into three main blocks of parame- 
ters and corresponding transformations. Firstly, from input to hidden state. 
Secondly, from previous to next hidden state, and thirdly from hidden state to 
the output. All RNNs have the form of a chain of repeated modules of neural 
networks called unit. Thus, an RNN unit includes the main blocks and addi- 
tional interacting layers or gates depending on the solutions to the vanishing 
gradient problem. Therefore, gates are realized as linear layers with an activa- 
tion function and an element-wise operation with the signal. LSTMs have an 
additional internal recurrence, called cell state or memory cell c, to the outer 
recurrence state of the RNN. The introduced self-loop adds a path for the gra- 
dients. In addition, some trainable gates control the information added and 
removed from the cell state. The transition equations of an LSTM are given 


by 


gj = sigm (Ws, h*"! + Wzg Z" + bg), 

gk = sign (Wig hk! + Wzg Z" +b,,), 

gk = sigm (W,,, h*-! + W;, z* b, ), 

ck = gh Och! + gf © tanh (W, h^! + W,-2* + b), (4.6) 
h* = gf © tanh(c*), (4.7) 


with the input gate vector gj, forget gate vector gy and output gate vector go. 
The operator © denotes the Hadamard product (element-wise product). The 
forget gate controls how much of the old cell values is used in the new cell 
value. The amount of input used in the new cell value is controlled by the 
input gate and the output gate controls how much of the new cell value is put 
out. Ina GRU, the LSTM cell is simplified by combining forget and input gates 
into a single update gate and merging the cell and hidden states. This results 
in fewer trainable parameters with comparable performance in specific tasks. 
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The GRU architecture is given by 


gi = sigm (Wi, h*-! + Wz,, z* + bs), 

gr = sign (Whg, h*! + W,, z* +b,,), 

h* = tanh (Ww, 2" + Win (gk © h*-1) + bi) n 

h* = (1 — gk) o h*-! + gk o h*, (4.8) 


I 


where g% is the update gate vector and gk is the reset gate vector. The GRU 
tackles the vanishing gradient problem without depending on an internal cell 
state. Thereby, the reset gate controls how much of the old output is kept as 
new input, and the update gate controls whether to use the new input or old 
output. For more information on different variants of RNNs, we refer to the 
following works [Gra13b, Gre17, Goo16]. The block diagrams of the repeating 
units of an RNN, an LSTM, and a GRU are illustrated in figures 4.2, 4.3, and 
44. ? 


4.13 Training 


As shown in section 2.2, the unfolded structure of an RNN corresponds to a 
directed acyclic computational graph. The graph maps the input trajectory 
Z= {zk : k € N} (or generally input sequence) to a corresponding sequence 
of outputs O = fok : k EN}. Thus, it provides an explicit description of 
which computation to perform and maps the input trajectory to outputs and 
losses. By information flow forward in time, the outputs and losses are com- 
puted. The parameter gradients are computed backward in time. Figure 2.4 
depicts the unfolded computational graph of a vanilla RNN. Due to the fact 
that RNNs are composed of differentiable operators, training the parameters 
can be done by minimizing any differentiable loss function £(®) using gradi- 
ent descent. The basic update cycle of gradient descent is to find the derivative 
of the loss function with respect to network parameters ©, then adjust the 


? Visualization inspired by Understanding LSTM Networks (https://colah.github.io/posts/2015-08- 
Understanding-LSTMs/ last accessed 19.12.2019.) 
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weights in the direction of the negative slope (parameter update). In section 
2.2, we demonstrated that the loss function of RNNs corresponds to maximum 
likelihood estimation with deterministic dynamics. Further, the loss function 
corresponding to the probability given by the network to the observed se- 
quence Z — {zk : k € Nl is given by 


L(®)rnwn = — >, log p(z“ |071). (4.9) 
k 


The choice of a loss function is directly related to the design of the output 
layer. One can think of the design of the output layer about framing the pre- 
diction problem and the choice of the loss function corresponds to the way of 
calculating the error. Thus, the output o” can be used to parameterize a pre- 
dictive distribution p(z**!|o*) over the possible next observation z**!. Due 
to the deterministic nature of RNNs, the computation of the predictive distri- 
butions is realized by the feed-forward operations in the unfolded network. 
Accordingly, the training of RNNs can be done similarly to feed-forward net- 
works by applying gradient descent methods to minimize a differentiable loss 


function £(@). 


The generalization of backpropagation for recurrent networks is called back- 
propagation through time (BPTT) [Wil95]. As explained briefly, backprop- 
agation is a technique to efficiently calculate the gradients of scalar valued 
functions with respect to their inputs. It boils down to a recursive application 
of the chain rule from calculus for the partial derivatives. In order to perform 
a parameter update, the gradients of the loss function Vox £(@) with respect 
to the parameters are required. The same process enables a computation of the 
gradients for the inputs. By applying the chain rule, evaluating the gradients 
of the output with respect to the inputs reduces to a product of Jacobian ma- 
trices which is the final gradient. In the forward pass all intermediate values, 
corresponding to all intermediate transformations, and the loss for a given 
set of training samples and the current parameters is computed. Thus, inputs 
and loss function take on specific values using fixed functions. The backward 
pass, backpropagation, proceeds in the reverse order through all intermediate 
stages by applying the chain rule to estimate the influence of local gradients 
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on the final output. Thus, the gradients are recursively chained through the 
functions that produced the values in the forward pass until the inputs are 
reached. For neural networks, the inputs of interest are the network param- 
eters and their gradients provide information on how to change the current 
parameters for minimizing the expected loss [Kar16b]. 


The basic update rule of gradient descent as an iterative optimization algo- 
rithm can be written as 


8£(0) 


k+1 A ak 
ee o 


= OF - 1,,VorL(®), (4.10) 
where Aj, is the learning rate of the neural network that determines the mag- 
nitude of the parameter change A®*. The index k refers to the time before and 
k +1 to the time after parameter update. Designing and training a neural net- 
work is not much different from training any other machine learning model 
with gradient descent. In contrast to standard gradient descent, where the 
gradient is calculated from the entire training dataset, stochastic gradient 
descent (SGD) performs a parameter update for a randomly selected subset 
of training samples. Sometimes, there is a minor distinction for the terminol- 
ogy of SGD. SGD is used for a parameter update for every data sample, and 
mini-batch gradient descent for a parameter update for every mini-batch of 
training samples. However, most algorithms for deep learning use more than 
one but less than all training samples, here for the sake of simplicity, only the 
term SGD is used [Goo16]. 


Despite SGD as a stochastic approximation of standard gradient descent, there 
exist many further modifications to overcome challenges of gradient descent- 
based optimization strategies such as over-fitting, slow convergence, and lo- 
cal minima. Due to the fact that second-order optimization methods, such as 
Newton’s method, require calculation of the Hessian matrix, in practice, al- 
most exclusively first-order algorithms based on gradient descent are used. In 
the following, we briefly outline commonly used modifications, but leave out 
second-order alternatives. 


The momentum method [Pol64] is designed to accelerate SGD, especially in 
the face of high curvature, small but consistent gradients, or noisy gradients. 
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The momentum method accumulates an exponentially decaying moving aver- 
age of past gradients and continues to move in their direction. Thereby, mo- 
mentum damps oscillation and speeds up convergence in cases where the sur- 
face of the loss function is in one dimension much steeper than in the other. 


Formally, the momentum method changes the parameter update by introdu- 
cing a velocity term that captures the direction and speed at which the param- 
eters move through parameter space. The adapted parameter update is given 


by 


9*! = g* — v5. with 
V6 = = AmoV6 | + Ai Vox £(9). (4.11) 


The added hyper-parameter Amo € [0,1] is termed momentum and controls 
the influence of previous update values. Common values of Amo used in prac- 
tice include 0.5, 0.9, and 0.99 or adapting Amo over time by starting from a 
small value and later increasing it. Rumelhart et al. [Rum86] showed that 
using a momentum term dramatically increases the convergence rate. 


Nesterov's accelerated gradient [Nes83] adds a correction factor to the stan- 
dard momentum method by evaluating the gradient after a one step prediction 
in the current direction. This update rule can be expressed as follows 


Okt! = ok — vý, with 

vb = AmoVS | + Ap Ver £(0 — vi). (4.12) 
This minor change intends to slow down earlier and reduce overshooting. In 
[Ben13], Bengio et al. showed that the resulting increased responsiveness 
helps to improve the performance of RNNs for a number of tasks. The mo- 
mentum term helps to speed up SGD and adapts the parameter update with 
respect to the slope of the error function but to the expense of introducing 
another hyper-parameter. One alternative is to adapt the learning rate by ap- 
plying a pre-defined schedule. A representative function is the exponential 
learning rate decay. Whereas the momentum term helps to speed up SGD and 
adapts the parameter update with respect to the slope of the error function, it 
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is also possible to allow for individual parameter updates depending on their 
importance. 


A heuristic approach to adapting individual learning rates for model param- 
eters during training is the delta-bar-delta algorithm [Jac88], but it is not ap- 
plicable to SGD optimization. Examples for this category of algorithms with 
a per-parameter learning rate method to perform more informative gradient- 
based learning are AdaGrad [Duc11], RMSprop [Tie12], Adadelta [Zei12], and 
Adam [Kin15]. 


The AdaGrad algorithm adjusts the learning rate individually by scaling them 
inversely proportional to the squared sum of all past gradients. AdaGrad per- 
forms well for some deep learning models, but the main weakness is that the 
accumulation of squared gradients from the beginning of training can result 
in an early and aggressive decrease in the effective learning rate [Goo16]. 


RMSprop and Adadelta are both extensions that help to reduce the effect of 
an aggressively decreasing learning rate. Contrary to AdaGrad, the changing 
gradient accumulation is replaced by an exponentially weighted moving aver- 
age of the squared gradients. RMSprop is actually a special case of Adadelta 
with weight decay factor of A44 = 0.9 for the gradients. The parameter update 
can be described by 


Vg £(0) 


AJ E[Vg (0) © Vok ^ (90) + EAdal 


E[V gx £(®) © Vor £(9)] = Aad ElVer-ı£(®) © Vor- £(9)] 
+(1 — Aug L(O) © Vor LO). (4.13) 


OKH = OF - 1, , with 


Here, €4gg is a smoothing term to avoid division by zero. The learning rate is 
divided by an exponentially decaying average of squared gradients. Tieleman 
and Hinton [Tie12] suggest to set the default value of the learning rate to 
0.001. 


Another optimization method that computes adaptive learning rates for each 
parameter is adaptive moment estimation (ADAM) [Kin15]. ADAM com- 
bines Adadelta and momentum by storing both the exponentially decaying 
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average of past squared gradients, like Adadelta, and exponentially decaying 
average of past gradients, like momentum. These are the first and second mo- 
ment of the gradients respectively. Further, Adam counteracts biases during 
initialization in the first-order moments (momentum) and the (uncentered) 
second-order moments by corrections factors. With Ay. 1, Apc? being the bias 
correction factors for the first and second moment estimates, the ADAM up- 
date rule is given by 


k+! = @k _ Aır 1 F[Ve£(0)]. 
EIV SOOM AO | (1 — Ape,1) 
(1-Ape2) ADAN 
(4.14) 


Kingma and Ba suggest to use Aj, = 0.002, Ap.) = 0.9 and Ap, 2 = 0.999 as 
default parameters. Adadelta and RMSprop also incorporate an estimate of the 
second-order moment but lack the correction factor. Thus, the estimates may 
have high biases in early stages of the training. In general, ADAM is consi- 
dered to be relatively robust to the choice of hyper-parameters. Nevertheless, 
no single best algorithm has emerged. In [Sch14], Schaul et al. conducted 
a comparative study on a large number of optimization algorithms across a 
wide range of learning tasks. The results show that the presented algorithms 
with adaptive learning rate perform fairly robustly, but without clear-cut best 
algorithm. Despite some further extension, such as Nadam [Doz16] and AMS- 
Grad [Red18], the most actively used optimization algorithms are standard 
SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta, 
and ADAM. Although ADAM and its extensions might be the best overall 
choices, SGD is much more reliant on a robust initialization and annealing 
schedule than other optimizers [Rud16]. 


4.1.4 Mixture Density Networks 


So far, we introduced some theoretical background related to the recurrent 
units and methods for training the networks. One remaining central ques- 
tion is how to parametrize the predictive distribution p(z*|o*-). Similar to 
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Bayesian filtering, the conditional probability density should allow to pre- 
dict observations and has to deal with real-valued inputs from the observed 
trajectory. The idea of mixture density networks (MDNs) [Bis94, Bis06] 
is to use the outputs of a neural network to parameterize a Gaussian mix- 
ture distribution. MDNs provide a complete framework for modeling con- 
ditional density functions. They overcome the limitations of conventional 
least-square approaches for dealing with multi-modal target data. Since the 
selected tasks of path and intention prediction are multi-modal problems by 
nature and MDNs can be used with RNNs [Sch00], our basic neural network 
to infer object states consists of an RNN with an MDN on top. This combina- 
tion was originally introduced by Graves for the generation and prediction of 
handwriting [Gra13a]. It is subsequently referred to as an RNN-MDN model 
or respectively as an LSTM-MDN model. 


Consider a path prediction problem where y*+! = p(z**! |z9:*) describes the 


next location and z°** are the corresponding observations leading to output 


o*. The conditional probability for a bi-variate Gaussian mixture is defined as 


follows: 
L 
pz^*!|o*) = 5 wEN E uf (0), of (o^), pr (o^). (4.15) 
l=1 
Here, o* is used to parametrize the L component Gaussian mixture model 


o* = (uk, et, pr. T NR In order to reduce notation clutter, the dependency 
on the network output will only be denoted in case it is not clear. In an MDN, 
the same function is used to predict the parameters of all of the densities com- 
ponents as well as the mixing coefficients. So, the non-linear hidden units are 
shared amongst the input-dependent functions. For a mixture of bi-variate 
Gaussians with L components, the network output generates 6- L parameters 
where mean and standard deviation are two-dimensional vectors, whereas the 
mixture weights and correlations are scalars. 
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In order to ensure that the mixture density forms a valid categorical distribu- 
tion, the weights are normalized with a softmax activation function: 


ak 
exp Ú; 
c= s. (4.16) 


au L ako 
Èi- exp Ô; 
The softmax function ensures that wk lie in the required range wi € (0,1) 


and x wk = 1. It realizes a generalization of the Bernoulli distribution 
corresponding to the usual logistic sigmoid. The variances can be represented 
in terms of the exponential of the corresponding network [Bis94, Gra13a]. 
The originally proposed exponential can lead to numerical instability, thus we 
employ a variant of the exponential linear unit (ELU) activation function 
[Cle16]. The transformation function is given by 


k ôk 1 for ik > "m 
| Aen (exp ór —1)4+1 fo ó*«0' i 


Both, the originally proposed exponential and the ELU variant avoid confi- 
gurations with variances which go to zero. As an alternative, also a softplus 
activation function can be used [Glo11]. The means represent location param- 
eters and can be represented directly with the network output as 


uk = pk. (4.18) 


Ensuring that the correlation coefficients lies in range er € (—1,1) is real- 
ized with a hyperbolic tangent activation function. The parameters of the 
MDN can be determined by maximum likelihood estimation, or equivalently 
by minimizing an error function defined to be the negative logarithm of the 


likelihood. 
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Substitute equation 4.15 into 4.9 yields the sequence loss 


K L 
L(Z,0)rwn = 2, —log (2 al ateb] 
l=1 


with 
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As a technical detail, the loss function is arranged, such that the log-sum-exp- 
trick [Pre07] can be applied. Thus, improving numerical stability: 
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As explained, RNN-MDN variants are commonly an adaptation of the model 
introduced by Graves [Gra13a]. Especially in the context of sequence pre- 
diction, including path prediction, these models are preferred to other neural 
networks which can also generate a probabilistic distribution over the out- 
puts. Popular alternatives mostly rely on Bayesian neural networks [Mac92, 
Bis95] or their recurrent extension [For17]. Instead of placing a distribution 
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over the output of the model, Bayesian neural networks use probabilistic neu- 
rons. Similar to models discussed in section 3.1, Bayesian inference has to be 
applied during training in order to determine the posterior distribution of the 
network parameters. Due to the intractable probability distributions, either 
Monte Carlo methods or approximate inference methods, such as variational 
inference [Blu15], inference based on expectation propagation [Her15], and 
Monte Carlo dropout [Gal16], has to be applied. The requirement of approxi- 
mate inference make training computationally more intensive and potentially 
less stable. Hence, despite their ability to allow probabilistic predictions and 
provide information about model uncertainty, these models are less widely 
used in practice [Hug19]. Compared to Bayesian neural networks or respec- 
tively Bayesian RNNs, RNN-MDNs have a simple structure and are thus easier 
to train and control. Hence, RNN-MDNs have been not only successfully ap- 
plied to model handwriting data [Gra13a], but also to model sketch drawings 
[Ha18], speech synthesis [Wan17], and, more importantly, in the context of 
this thesis to generate trajectory predictions [Vem18, Ala16, Zha19, Xue19]. 


Due to the capabilities of RNNs to model arbitrary functions and the success of 
the RNN-MDNin a variety of sequence processing tasks, we use RNN-MDN as 
basic architecture for our pattern-based solutions to model object trajectories 
and additionally capture the predictive distribution. 


4.2 RNN-based Solutions 


In this chapter, the ability of RNN-based solutions to deal with maneuvering 
objects is analyzed. Thereby, we distinguish between the two types of ma- 
neuvers which are normally tackled with multiple-model approaches. These 
maneuver types are the switch in noise levels and the switch in motion behav- 
ior. They are considered separately. The analysis is done for the two selected 
tasks of path and intention prediction. For both tasks, higher-level process- 
ing strongly relies on the state estimation performance. The approaches are 
realized as a top-down component as part of a vision-based tracking system. 
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Due to cross-disciplinary interest, as shown in section 2.3, there exists a fast- 
growing amount of different approaches. The requirements on the perfor- 
mance quality depend on the application domain and particular use cases 
within. Due to the wide variety of applications, different levels of contex- 
tual cues, and the amount of existing diverse methods, a standardized bench- 
marking is difficult to achieve. Here, path prediction is used for comparison 
to related approaches for motion prediction due to the fact that there exists a 
public standard benchmark, and the corresponding data includes variation in 
the noise levels. Intention prediction is selected for capturing scenarios within 
the application domain of intelligent vehicles to evaluate the ability of the 
proposed solutions with respect to the switching dynamics of objects. For 
specific use-cases, such as pedestrian crossing, there are not only strict re- 
quirements on the prediction horizons, but in addition, there are suited solu- 
tions for physically modeling the pedestrian motion. As discussed in section 
2.3, the essential difference is the short time window for prediction, and thus 
the more dominant role of physics-based multiple-model approaches. 


4.2.1 Path Prediction 


Although the integration of more contextual cues can be crucial to improve 
motion prediction and pattern-based methods can theoretically capture all 
contextual cues present in the training data, we only rely on the informa- 
tion provided by an underlying object tracker. For path prediction, this is 
a sequence of past positions in order to infer future positions. Firstly, the 
physics-based traditional alternatives, which we aim to replace, also solely 
rely on the observed object states. Secondly, not only the incorporation of 
contextual cues lead to very complex algorithms for physics-based methods, 
but also the design and training of the network get more complicated, e.g., 
due to the increased dimension of input data. 


However, even without additional cues, there are many pitfalls when using 
neural network-based alternatives for path prediction. In the following our 
analysis reveals failure cases and gives explanations for observed phenomena. 
Further, we provide recommendations for overcoming shortcomings which 
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enabled our proposed solution to achieve top-rank at the public TrajNet 2018 
challenge. Our aim is to achieve a more reliable prediction network. Despite 
the simple core architecture, the network can achieve a performance compa- 
rable to more elaborated models with regards to considering more cues than 
solely position information. 


Achieving the objective of finding an effective prediction network involves, 
on the one hand, evaluation of different deep neural networks and, on the 
other hand, an analysis of the properties of the dataset. The dataset analysis 
directly motivates the required modification to enable a robust prediction. 


4.2.1.1 Dataset Analysis 


Although many datasets for path prediction are publicly available, the Tra- 
jNet [Sad18] benchmark and the corresponding challenge is the first attempt 
to build a standard benchmark for path prediction and provides a platform 
for comparison. The challenge is in particular called the world plane human- 
human TrajNet challenge (World H-H TrajNet). 


Figure 4.5: Example trajectories from the BIWI ETH dataset and example tracklets from the se- 
quence Hyang 07 from the Stanford Drone Dataset (SDD). 
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The TrajNet dataset is a superset of the commonly used surveillance datasets 
which cover real-world scenarios with varying crowd densities and varying 
complexity of trajectory patterns. In most datasets, the scene is observed from 
a bird's eye view, but there are also scenarios where the scene is observed 
under a higher depression angle. Details of the datasets are summarized in 
table 4.1 (adapted from TrajNet website). The selection includes the following 
datasets. The BIWI dataset [Pel09] also referenced to as ETH Walking Pedes- 
trians, which is split into two sets (ETH and Hotel). The UCY dataset also 
referred to as Crowds-by-Example dataset [Ler07] contains three scenes from 
an oblique view, where the first (Zara) shows a part of a shopping street, the 
second (Students/Uni Examples) captures a part of the university campus, and 
the third scene (Arxiepiskopi) captures a different part of the campus. Then, 
the Stanford Drone Dataset (SDD) [Rob16] consists of multiple aerial images 
capturing different locations around the Stanford campus. Furthermore, the 
PETS 2009 dataset [Fer09], where different outdoor activities of crowds are 
observed by multiple static cameras. Sample images with full trajectories and 
tracklets are shown in figure 4.5. The term tracklet refers to a part of a longer 
trajectory. 


Table 4.1: Training (gray = ) and test (blue = ) dataset of the world plane human-human dataset 
challenge (adapted from the TrajNet website‘ [Sad18]). 


Name Resolution #Pedestrians Frame rate / fps Reference 
BIWI Hotel 720 x 576 389 2.5 [Pel09] 
UCY Zara 720 x 576 204 2.5 [Ler07] 
UCY Students 720 x 576 415 25 [Ler07] 
UCY Arxiepiskopi 720 X 576 24 2.5 [Ler07] 
PETS 2009 768 x 576 19 2.5 [Fer09] 
Stanford Drone Dataset (SDD) | 595 x 326 3295 25 [Rob16] 
BIWI ETH 640 x 480 360 2.5 [Pel09] 
UCY Zara 720 x 576 148 25 [Ler07] 
UCY Uni Examples 720 x 576 118 25 [Ler07] 
Stanford Drone Dataset (SDD) | 595 x 326 3297 25 [Rob16] 


^ TrajNet website: (http://trajnet.stanford.edu/, last accessed 19.12.2019) 
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It should be noted that for the World H-H TrajNet challenge, path prediction is 
performed on ground level in a world coordinate system with available cues 
from the dynamic environment in form of observed trajectories from other dy- 
namic objects in the scene (human-human). Hence, contextual cues are im- 
plicitly used to realize a mapping from the image space to a 3D reference sys- 
tem. Since the scenarios capture static surveillance scenes, this is realized un- 
der a flat world assumption and by estimation of individual homographies for 
every scene. In accordance with our requirements, path prediction relies for 
the challenge on position data provided by an underlying system. In particu- 
lar, the ground truth trajectories are generated by a visual tracker or are man- 
ually annotated. It is common and good practice to apply cross-validation. For 
the TrajNet challenge, this is done by omitting complete datasets for testing. 
This is reasonable, given the fact that the interaction behavior of humans in 
open spaces is scene-independent and in order to measure the generalization 
capabilities of various approaches across datasets. 


80 80 


60 5 60 5 


40 4 40 4 


y in meters 
n 
S 

y in meters 
N 
S 


X in meters X in meters 
Figure 4.6: (Left) Visualization of all tracklets of the training set from the TrajNet dataset collec- 


tion. (Right) Visualization of all initialization tracklets of the test set. 


Nevertheless, by combining all training sets the spatial context of scene- 
specific motion and the reference systems are lost. When only relying on 
observed motion trajectories, positional information is crucial in order to 
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learn spatio-temporal variation. For example, the sidewalks in the Hyang 
sequences (see figure 4.5) lead to a spatial-dependent change in the curvature 
of a trajectory. Since our focus is on deep neural networks including RNNs, 
shifting from position information to offsets helps to overcome some draw- 
backs. Before RNNs were successfully applied for tracking pedestrians in a 
surveillance scenario, they gained attention due to their success in tasks like 
speech recognition [Gral3c, Chu15] and caption generation [Don17, Xu15]. 
Since these domains are particularly different from trajectory prediction in 
certain aspects, their position-dependent movement is not important. Ac- 
cordingly, RNNs can benefit from conditioning on offsets, instead of absolute 
positions, for scene-independent motion prediction. This insight is not new, 
yet utilizing offsets helps not only to stabilize the learning process but also 
to improves the prediction performance for the evaluated networks. This 
shift to offsets or rather velocities has also been successfully applied for 
example for the prediction of human poses based on RNNs [Mar17]. In the 
context of deep networks, the same effect can also be achieved by adding 
residual connections, which have been shown to improve performance on 
deep convolutional networks [He16]. Presumably due to the limitation of 
the input and output spaces, for applying on the TrajNet challenge instead 
of predicting the next position (where will the person be next) predicting 
the following offsets (where will the person go next) [Hug17, Hug18] also 
contributed to increased prediction accuracy. This becomes immediately 
apparent by looking at the complete tracklets of the training and test set (see 
figure 4.6). 


Firstly, it takes a considerably higher modeling effort to represent all possible 
positions instead of modeling particular velocities. Further, input data outside 
the training range can lead to undefined states in the deep network, which re- 
sults in an unreasonable output. Some of the initialization tracklets clearly lie 
outside the training input space. Also, approaches which benefit from human- 
human interaction such as [Gup18, Has18, Ale17, Ala16] in combination with 
deep networks lack at this point information about surrounding persons to in- 
teract with, so that the decoding of relative distances is not possible because 
of a reduced person density. Note that the ability of RNN-based solutions to 
capture environment cues from position data only is per se a positive ability, 
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but they require a sufficient amount of training data without too large gaps 


in the input data. This is further discussed in the subsequent sections. 
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Figure 4.7: (Top Left, Top Right) Offset histograms of the training set. (Bottom) Magnitude his- 


togram of the offsets. 


Another factor for improving prediction performance is becoming apparent 


when contemplating the offset distribution of the data. Figure 4.7 shows the 


offsets histograms for x and y separately. Due to the loss of the reference 


system, it is impossible to assume a reasonable location distribution prior. In 


contrast, the offset and magnitude distribution clearly reflects the preferred 
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walking speeds in the data. The histograms also show that a large number 
of persons are standing. In the recent work of Hasan et al. [Has18], it was 
emphasized that forecasting errors are in general higher when the speed of 
persons is lower and argued that when persons are walking slowly, their be- 
havior becomes less predictable, due to physical reasons (less inertia). During 
our testing, we discovered the same phenomenon. In particular, RNN-based 
networks tend to overestimate slow velocities and do sometimes not accu- 
rately identify the standing behavior. 


Despite this problem, the range of offsets is very limited compared to the lo- 
cation distribution and shows a clear tendency towards expected prior values. 
Common techniques for sequence prediction problems are normalization and 
standardization of the input data. Whereby normalization has a similar role 
on the position data, applying standardization on position input data shows 
no benefit. In our experiments, standardization worked slightly better than 
normalization or an embedding layer for input encoding. Although the effect 
on the performance is quite low for the TrajNet challenge, our best result is 
achieved using standardized offsets as input. It is rarely necessary to standard- 
ize the inputs, but there are practical reasons like accelerating the training or 
reducing the chances of getting stuck in local optima [Bro17]. Predicting off- 
sets also guarantees that the output directly conforms better with the range 
of common activation functions. Through standardization of the offsets, the 
network uses the deviations from the preferred pedestrian walking behavior 
to predict changes in their behavior. 


Without discretization artifacts, the dynamics of humans are smooth and 
persistent. The trajectory data from the TrajNet dataset includes varying 
discretization artifacts or noise levels resulting from different methods with 
which ground truth data was generated. As explained, part of the ground 
truth trajectories are generated by a visual tracker. For approximating the 
amount of noise in the datasets, the distance between a smoothed spline fit 
through the complete tracklets is compared to the provided ground truth 
tracklet points. The spline fitting is done with a polynomial function with 
a varying degree (1,2,3,4) independent for the x and y values. By selecting 
the individual best fit in regards to the mean squared error for a single 
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trajectory fit, over- and under-fitting is prevented. The remaining error is 
used to approximate the noise for single trajectories. Despite the fact that 
using a fitted trajectory as pre-processed ground truth can also induce some 
errors, the fitted trajectories capture the continuous, persistent motion better. 
Further, the complete history or rather full trajectories are considered instead 
of only short time windows which makes the fit more robust. Nevertheless, 
the achieved fitted trajectories form a smooth and natural path and are used 
as rough assessment for the noise levels in the ground truth trajectory data. 
The results for the training set are summarized in table 4.2. 


The approximated noise levels show the variation in the ground truth data. 
In order to outperform a linear baseline predictor, the learned model must be 
able to successfully model different velocity profiles and capture curved paths 
out of input data with different noise levels. Initial experiments to solely train 
on smoothed fitted trajectories with synthetic noise performed worse. Nev- 
ertheless, for the prediction of future steps, the best performing predictor is 
trained to forecast smoothed paths. Before the different evaluated models are 
introduced, the last data analysis of the training set is intended to assess the 
complexity in terms of the non-linearity of the trajectories. Therefore, the co- 
efficient of determination R? for linear interpolation is calculated separately 
for the x and y values. This linear interpolation serves as a baseline predictor 
for the TrajNet challenge. The histograms of R? for the training set are shown 
in figure 4.8. R? is the percentage of the variation that is explained by the 
model and is used to determine the suitability of the regression fit as a linear- 
ity measure [Dra66]. The average R? values are summarized in table 4.2. It 
can be seen that for most tracklets, a linear interpolation works well. In order 
to outperform the linear interpolation baseline, it is crucial to not only cover a 
variety of complex observed motions but to also produce robust results in sim- 
pler situations. As shown, the person's velocity has to be effectively captured 


by the model. 


The analysis of the TrajNet dataset shows that the main challenges to achieve 
a good, robust prediction performance for the World H-H TrajNet challenge 
are the following. Firstly, generalization ability across datasets. Secondly, 
the ability to deal with varying noise levels due to mainly including straight 
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Table 4.2: Standard deviation of the distance between a smoothed spline fit and the ground truth 
trajectory data. The average R? score for all tracklets in the subsets. 


Name Ox,spline / M Cy, spline / M Rz RS 
Overall 0.067 0.069 0.889 0.811 
BIWI Hotel 0.042 0.031 0.637 0.876 
UCY Zara_02 0.029 0.035 0.952 0.758 
UCY Zara_03 0.026 0.031 0.935 0.716 
UCY Students_01 0.033 0.029 0.868 0.852 
UCY Students_03 0.039 0.040 0.915 0.760 
UCY Arxiepiskopi_01 0.050 0.027 0.959 0.677 
PETS 2009 S2L1 0.037 0.026 0.781 0.877 
SDD Bookstore 00 0.060 0.063 0.889 0.844 
SDD Bookstore 01 0.054 0.053 0.879 0.878 
SDD Bookstore 02 0.068 0.073 0.861 0.921 
SDD Bookstore 03 0.069 0.061 0.951 0.830 
SDD Coupa 03 0.057 0.043 0.954 0.937 
SDD Deathcircle 00 0.072 0.079 0.893 0.808 
SDD Deathcircle 01 0.086 0.103 0.850 0.818 
SDD Deathcircle 02 0.151 0.158 0.772 0.591 
SDD Deathcircle 03 0.116 0.134 0.816 0.770 
SDD Gates 00 0.054 0.073 0.980 0.735 
SDD Gates 01 0.064 0.084 0.859 0.890 
SDD Gates 03 0.086 0.106 0.847 0.860 
SDD Gates 04 0.071 0.155 0.820 0.906 
SDD Hyang. 04 0.048 0.050 0.829 0.842 
SDD Hyang_05 0.059 0.081 0.872 0.740 
SDD Hyang. 06 0.070 0.066 0.875 0.811 
SDD Nexus 00 0.076 0.082 0.886 0.742 
SSD Nexus 02 0.069 0.074 0.934 0.726 
SDD Nexus 07 0.053 0.069 0.935 0.764 
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walking persons with different noise levels. Nevertheless, the scenarios also 
include behavior changes resulting, inter alia, from human-human interac- 
tion. But for unaware motion modeling, a change in behavior has to be esti- 
mated from the corresponding individual tracklet alone. 
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Figure 4.8: Coefficient of determination R? for x and y for all training tracklets of the World 
H-H TrajNet challenge. 


4.2.1.2 Models and Evaluation 


Finding an effective prediction network is done by using a coarse-to-fine 
searching strategy to reach the maximum achievable prediction accuracy 
without further cues like human-human interaction or human-space inter- 
action based on basic networks. Towards this end, we started with a set 
of networks with a limited set of hyper-parameters to narrow it down to 
one network, in order to then extend the hyper-parameter set for a more 
exhaustive tuning. 


For the World H-H TrajNet challenge, the performance is compared using the 
two error metrics of average displacement error (ADE) and final displace- 
ment error (FDE). These metrics are commonly used to assess path prediction 
performance (see for example [Ala16, Vem18, Pel09, Gup18, Xue18, Has18]). 
The average of both combined values are then used as an overall average to 
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rank the approaches. The ADE is defined as the average L2 distance between 
ground truth and the prediction over all predicted time steps, and the FDE is 
defined as the L2 distance between the predicted final position and the true 
final position. For the World H-H TrajNet challenge, the unit of the error 
metrics is meter. For all experiments, 8 (3.2 seconds) consecutive positions 
are observed, before predicting the next 12 (4.8 seconds) positions. Since the 
maximum-likelihood path is used for evaluation, the networks are initially 
realized as regressors by using the squared loss as a distance function for the 
path being predicted. 


For a fixed observation window, as used in the World H-H TrajNet challenge, 
MLP-based networks can also be used to realize a prediction network, they 
are also considered in our coarse evaluation. Moreover, by adapting the ar- 
chitecture or by applying a fixed time window solution in a sliding-window 
fashion, the network can be used for longer input length. In contrast to other 
sequence prediction tasks such as natural language processing, the relevant 
time horizon for path prediction is shorter. Besides the described architecture 
from section 4.1, temporal convolutional networks (TCNs) are more of- 
ten used to encode the observations for fixed time horizons. Due to their less 
complex structure, they are easier to train and to control [Bai18, Mil19]. Re- 
cent results indicate that TCNs can compete with RNNs in terms of sequence 
prediction tasks such as audio synthesis and language processing. 


The following basic neural networks and corresponding variants are selected 
for a coarse evaluation in addition to approaches from the community and 
approaches provided by the organizers of the TrajNet challenge. 


MLP: The MLP is tested with different linear and non-linear activation func- 
tions. One variation concatenates all inputs and predicts 24 outputs directly. 
Further, cascaded architectures with a step-wise prediction are examined. We 
vary between different coordinate systems of Euclidean and polar coordi- 
nates. As discussed in section 4.2.1.1, positions and offsets (also orientation 
normalized) are considered as inputs and outputs. 
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RNN-MLP: Vanilla RNNs produce an output at each time step. For the eva- 
luation of the RNN-MLP, we vary only the MLP which is used for the decoding 
of the positions and offsets. 


RNN-encoder-MLP: In contrast to the RNN-MLP network, the complete ini- 
tialization tracklet is used to generate the internal representation before a 
prediction is done. The RNN-encoder-MLP is varied by alternating activation 
functions for the MLP and by alternatively predicting the complete future 
path/offsets instead of only next steps. As a further alternative, the full path 
is predicted as offsets to one reference point instead of applying path integra- 
tion in order to predict the final position. 


RNN-encoder-decoder-model (Seq2Seq): In addition to RNN-encoder- 
MLPs, Seq2Seqs include a second network. This second decoder network 
takes the internal representation of the encoder and then starts predicting the 
next steps. The different settings for the evaluation of this model were due to 
alternating activation functions for the MLP on top of the decoder RNN. 


Temporal convolutional networks (TCN): As an alternative to RNNs and 
based on WaveNets [van16], Bai et al. [Bai18] introduced a general convolu- 
tion architecture for sequence prediction. We tested their standard and ex- 
tended architecture with a gating mechanism (GTCN). For a more detailed 
description, we refer to the original papers. 


All networks were trained with varying numbers of layers (1 to 5) and hid- 
den units (4 to 64) using stochastic gradient descent with a fixed learning 
rate of 0.005. The models are trained for 100 epochs using ADAM optimizer 
[Kin15] and have been implemented in Tensorflow [Aba15]. Firstly, only stan- 
dard RNN cells are used for the experiments. Later, we also tested with RNNs 
variants LSTM [Hoc97], and GRU [Cho14] (see section 4.1). As loss the mean 
squared error between the predicted and the ground truth position or offsets 
over all time steps is used. 


In order to emphasize trends, an excerpt of the experiments’ results is summa- 
rized in table 4.3. The best results were achieved with the RNN-encoder-MLP. 
However, in most cases the different architectures perform very similarly. 
These initial results also show that the best performing networks lie close to 
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the result achieved with linear interpolation. Since the previous dataset anal- 
ysis revealed that for a large amount of tracklets linear interpolation works 
quite well, it is a crucial requirement to produce stable results also in simple 
situations. Thus, factors such as strong overestimation of slow person veloc- 
ities and some undefined random predictions when using positions can lead 
to weak performances compared to simple baselines. For reducing the effect 
of overestimation of slowly walking persons, Hasan et al. [Has18] integrated 
head pose information. However, this information requires a suitable head 
pose detector and this additional cue is not available for the TrajNet chal- 
lenge. We can only remark for the tested networks that this effect can also 
differ for different runs. Naturally, it is important that during training, the 
networks see enough samples of standing and slow-moving situations. In- 
stead of excluding such samples through heuristics or probabilistic filtering, 
which can help during application, we counteract this by data augmentation 
(see next section). 


Table 4.3: Results from our coarse evaluation on the data corresponding to the world plane 
human-human dataset (World H-H TrajNet challenge). In contrast to the results in 
table 4.4, the shown results are not generated with the official benchmark toolkit. Al- 
though the same datasets are used for training and testing, the exact test set selection 
of ground truth trajectories or tracklets for the challenge are not publicly available, 
and thus the results may vary. 


Approach Overall Average FDE/m ADE/m 
Linear interpolation 0.894 1.359 0.429 
Linear MLP (Pos) 1.041 1.592 0.491 
Linear MLP (Off) 0.896 1.384 0.407 
Non-linear MLP (Off) 2.103 3.181 1.024 
Linear RNN 0.951 1.482 0.420 
Non-linear RNN 0.841 1.300 0.381 
Linear RNN-encoder-MLP 0.892 1.381 0.404 
Non-linear RNN-encoder-MLP 0.827 1.276 0.377 
Linear Seq2Seq 0.923 1.429 0.418 
Non-linear Seq2Seq 0.860 1.331 0.390 
TCN 0.841 1.301 0.381 
Gated TCN 0.947 1.468 0.426 
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These results show that using deep neural networks for path prediction can 
produce undesired unstable results. Further, there is no clear-cut best per- 
forming model and clear guidance towards a class of models. Thus the gap 
between an MLP predictor and a Seq2Seq model is very narrow in the test 
scenarios. However, besides the factors derived from the data analysis, a pre- 
diction of the full path instead of step-wise prediction helps to overcome an 
accumulation of errors that are fed back into the networks. For the TrajNet 
challenge with a fixed prediction horizon, we thus prefer the RNN-encoder- 
MLP over a Seq2Seq model. In the domain of human pose prediction based 
on RNNs, Zhou et al [Zho18] reduced this problem with an auto-conditioned 
RNN, and Martinez et al. [Mar17] propose using a Seq2Seq model along with 
a sampling-based loss. In [Zim12], Zimmermann et al. reported that extend- 
ing prediction into the future helps to balance the information flow and to 
achieve a better input-output relationship. They called the prediction of more 
future steps from the encoded representation overshooting. In accordance to 
the results presented by [Mil19] for other sequence prediction tasks, TCNs 
perform also very similar to RNNs for path prediction. About the same time 
as our results on path prediction were published [Bec18c, Bec18b], [Nik18] 
Nikhil and Morris concordantly reported that TCNs yield competitive results 
for pedestrian path prediction. 


Despite several benefits of TCNs over RNNs and variants, we stick with an 
RNN-based solution due to its connection to Bayesian filtering. Furthermore, 
RNNs are more common as part of architectures which model interactions 
[Ala16, Ale17, Has18, Xue18] to represent single motion (see section 2.3). Due 
to results from the initial evaluation and the discussed reasons, we chose an 
RNN-encoder-MLP as our favored model to further apply the ablation study 
to achieve more robust performance results. 
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4.2.1.3 RNN-based prediction network: RED Predictor 


For the comparison to existing approaches, the selected model is an RNN- 
encoder-MLP. In this section, the final design choices which lead to the sub- 
mitted predictor and achieved top-rank at the World H-H TrajNet challenge, 
are summarized. 


The RNN-encoder can generalize to deal with varying noisy inputs and is thus 
able to better capture the person's motion compared to the linear interpolation 
baseline. The main insight is that motion continuity is easier to express in off- 
sets or velocities because it takes considerably more modeling effort to repre- 
sent all possible conditioning positions. Especially for the World H-H TrajNet 
challenge, with different ranges for positions in the training and test set, this 
has a significant influence on whether a good performance can be obtained. 
Instead of using the given input sequence Z = ((x*,y*) € R?|k = 1, ... ‚Kops} 
of Kops consecutive pedestrian positions along a trajectory, here the offsets 
are used for conditioning the network Z — ((AE,AE) ERIK = 2, E. 
Apart from the smaller modeling effort to represent conditioned offsets and 
the prevention of undefined states due to a suitable data range, this domain 
shift makes data pre-processing like the used standardization more reason- 
able. Since the offset or rather velocity distribution follows approximately 
a normal distribution around the expected walking speeds of pedestrians in 
contrast to the position distribution, through standardization of the offsets, 
the expected behavior is straight walking, and thus the network uses the de- 
viations from the dominant walking pattern as inputs. 


In our work [Hug17], we demonstrated that RNN-based solutions can capture 
environment cues from position data only. Therefore, they incorporate scene- 
specific knowledge which in turn can be a hindrance for generalizing across 
scenes. This actually positive ability relies on additional contextual cues to 
enable a better transfer to other scenarios. Thus, by using trajectory sam- 
ples from other datasets, an undesired scene-prior is included. Naturally, this 
effect gets stronger in case the scene includes a clear spatially-dependent be- 
havior, such as roundabouts and crossings which are present in the Stanford 
Drone Dataset (SDD). For short time horizons and to better generalize across 
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different and unseen environments, the switch to relative positions (all train- 
ing tracklets start at the origin), and to use offsets should be preferred. Thus, 
the spatial information only persists in an implicit fashion by performing path 
integration. As discussed in section 2.3, for longer time-horizons the inten- 
tion of the object motion, here pedestrians, is more strongly motivated by its 
goals. Classifying the goals of pedestrians in a scene requires further scene 
knowledge. The above input modifications help in training neural networks 
by scaling the inputs to a reasonable range, although in theory the desired 
scaling can be achieved only with appropriate weights and biases. By using 
non-linear activation functions such as sigmoid type functions, it is impossi- 
ble to achieve an ever increasing trend. Nevertheless, a network can achieve 
output values greater than the bound of a single neuron, but the network 
can saturate at minimum or maximum values, in particular for trending in- 
put data. Straight walking can be interpreted as increasing trended data of 
position along a trajectory. The last observation can be used to provide a di- 
rect connection to the output layers, which is realized as linear MLP. By this 
non-squashing connection, the saturation problem can be countered. Due to 
careful scaling around the known preferred walking behavior inside reason- 
able bounds, the network can better handle the trend in trajectory data dom- 
inated by straight walking. Since the pedestrian motions are not raw trended 
sequential data and due to careful pre-processing, a direct connection for re- 
ducing the saturation effect is not mandatory but should be kept in mind for 
using position data only. 


In order to deal with discretization artifacts in the ground truth trajectories 
and to make further training easier, smoothed trajectories are used as the de- 
sired output. As described, the minimal spline fit with a polynomial function 
of varying degree for a complete tracklet is used to achieve smoother and more 
persistent dynamics. Nonetheless, RNNs can generalize over the outputs and 
produce smooth predictions, but the intention is not to synthesize the noise, 
but make training easier in terms of reducing the artifacts in the ground truth 
data. As a drawback, the fit can produce incorrect results in some cases, but 
overall the trajectories look more natural and smooth. Especially if longer 
tracklets or rather complete trajectories can be considered, the fitting results 
improve due to the incorporation of more data points. In case of poor spline 
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fit corresponding to a large fitting error for all degrees ofthe polynomial func- 
tion, the original trajectory is kept. Examples of fitted ground truth tracklets 
are depicted in figure 4.9. 
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Figure 4.9: Example visualization of pre-processed ground truth trajectories to produce a more 
persistent motion behavior and reduce ground truth discretization artifacts. 


In order to reduce the effect of error accumulation during a step-wise pre- 
diction, overshooting is applied [Zim12]. Instead of feeding back RNN outputs 
step-wise, the encoded representation is used to apply a multi-step prediction. 
For the TrajNet challenge, the whole future path is predicted. Full path inte- 
gration works similarly well, but here offsets to the reference positions (last 
observed position) are predicted. 


Physics-based models, such as a CV model, are independent of the absolute 
values of states despite not being physically plausible. In spite of the loss of 
physical intractability and related problems, such models can be applied as 
general translational models without modification. For example, this applies 
to tracking an object in image space as discussed in section 3.2. 


In contrast, deep learning-based models can over-fit to limited training sam- 
ples and are not able to generalize to input data which is too far outside the 
seen input range. For example, in the BIWI ETH dataset or BIWI Hotel dataset, 
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the trajectories include a strong bias along a specific direction. Thus, training 
on BIWI ETH trajectories and inferring on rotated trajectories lead to under- 
estimation of the true velocity along the dominating direction in the training 
samples. This effect can result from over-fitting to data bias or scene-prior, 
deviation from input range or it can be seen as a lack of generalization ability. 
In addition to the explained factor, this potential shortcoming can be coun- 
teracted by data augmentation. Due to the fact that there exist several rea- 
sonable physics-based models to describe pedestrian behavior, it is possible to 
augment the training data by simulating realistic motion profiles. Although 
the scenarios of the TrajNet challenge are relatively well suited to improve 
training by simulation due to the fact that pedestrian behavior is evaluated 
on ground levelin a world coordinate system, data augmentation is only done 
by reverting all training tracklets of the provided challenge data. Thereby, the 
amount of training samples reflecting single object behavior is directly dou- 
bled. For the submitted results, no further data augmentation techniques are 


applied. 


The discussed effect of over-fitting to a scene is depicted in figure 4.10. 
The images show the prediction of an RNN-encoder with an MDN, which 
parametrizes a bi-variate Gaussian, as last layer for an original tracklet of the 
ETH dataset and a tracklet rotated by 90°. As a loss, a linear combination 
of the negative log-likelihood for the ground truth future positions under 
the predicted positions (see section 4.19) and the mean squared error to the 
ground truth trajectories is used. The models are both trained on a subset of 
the ETH dataset. On the left, an original test tracklet is used to infer future 
positions and covariances. On the right, a rotated test tracklet. It is clearly 
visible that the model on the right produces a bad prediction due to the effects 
described above. 


The proposed simple but effective predictor for the TrajNet challenge com- 
bines all the listed factors. At its core, the architecture is an RNN-encoder 
with a dense MLP stacked on top (RED). Hence, the predictor is referred to 
as RED predictor when it is realized as pure regression model. In the demon- 
stration example from figure 4.10 where the MLP is replaced with an MDN to 
capture also the predictive distribution, the term RED-MDN is used. 
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Figure 4.10: Qualitative results with an RNN-encoder model with well adapted input range and 
without. On the left and on the right the same observations are used, but on the 
right they are rotated. (Left) The model is able to represent the motion and make a 
reasonable prediction. (Right) Due to unbalanced training samples with trajectories 
mainly orientated along one direction, the network produces a poor estimate for a 
90° rotated trajectory. 


Realized as a regression model and without direct connection to the last ob- 
servation, the RED predictor can be defined by: 


hPa = RNNCh kae AG. y; Oenc) 


k+k k+k K 
yKpre = {(Ax Mea De) + em = MLP(hé nc; ©mzp). 


(4.21) 


Here, RNN(-) is the recurrent network, h.„. the hidden state of the RNN- 
encoder with corresponding weights W, nc and biases benc (parameters Oene), 
which is used to generate the full, smoothed path. The term MLP(-) reflects 
an MLP including the conforming weights Wir p and biases byrp to map the 
vector henc to the observation space (Oyrp = {Wyrp.bmrp}). The overall 
architecture is visualized in figure 4.11. 
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Figure 4.11: Visualization of the RED architecture. The conditioning is done for the full initia- 
lization sequence Z — (ASAP) € R?|k = 2,..., 8}. The internal representation 
is then used to predict the desired path at once (all 12 positions) using the last ob- 
served position (x9,y9) as reference for localization. 


The submitted results of our final RED predictor is highlighted in red in table 
4.4, using the official benchmark toolkit. Some qualitative predictions exam- 
ples from the RED-MDN on the BIWI ETH and BIWI Hotel dataset are visu- 
alized in figures 4.12, 4.13, and respectively in figures 4.14, 4.15. Although 
the dominating behavior of the pedestrians is straight walking, the scenarios 
also include diverse and more complex behaviors. The images depict that the 
network is able to capture the different motion types and to adapt by incorpo- 
rating new observations. Prediction is done for individual pedestrians solely 
based on the observed trajectory. The examples show that, despite using any 
cues from other persons, the RED-MDN predicts similar dynamical behav- 
ior for persons walking closely together in a group. By comparing different 
sample situations, it can be seen that the network is able to model different 
walking speeds. It can also be seen, how the prediction is adapted when the 
dynamics are changing. In figure 4.14 and 4.14, a deceleration and a stopping 
behavior is correctly captured. The ability of RNN-based network to deal with 
the maneuver type of changing dynamics is discussed in detail in section 4.2.2. 


After a performed fine search for the selected network, the shown result of ta- 
ble 4.4 is produced with an LSTM unit (state size of 32) and one recurrent layer. 
The proposed predictor is able to produce sophisticated results compared to 
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elaborate models which additionally rely on interaction information such as 
the model from Yamaguchi et al. [Yam11] (extended social-force-field model 
based on the approach of Helbing and Molnár [Hel95]) and the Social-LSTM 
[Ala16]. Compared to all submitted approaches ofthe World H-H TrajNet 2018 
challenge, the RED predictor achieved the best result. All other results were 
either officially submitted or provided by the organizers. 


Table 4.4: Results for the world plane human-human dataset (World H-H TrajNet) challenge 
including our submitted RNN-based approach (RED predictor) The results of ap- 
proaches marked with (*) are directly obtained from corresponding papers. Other 
results are taken from the TrajNet website‘ [Sad18]. 


Approach Overall Average FDE/m ADE/m Reference 
RED predictor 0.783 1.207 0.359 Ours 
SR-LSTM 0.815 1.229 0.370 [Zha19] 
Social Forces (EWAP) 0.819 1.266 0.371 [Yam11] 
FISHY 0.820 1.256 0.375 
JHU 0.844 1.304 0.384 
Predictor SUL 0.887 1.374 0.399 
Linear interpolation 0.894 1.359 0.429 
Social Forces (ATTR) 0.904 1.395 0.412 [Yam11] 
LVA* 0.945 1.449 0.438 [Xue19] 
SGAN 1.334 2.107 0.506 [Gup18] 
OSG 1.385 2.106 0.664 
Social LSTM 1.387 2.098 0.675 [Ala16] 
LV* 1.398 2.072 0.723 [Xue19] 
Interactive Gaussian Processes 1.642 1.038 2.245 [Ell09] 
Vanilla LSTM 2.107 3.114 1.100 
Occupancy LSTM 2.111 3.12 1.101 [Ala16] 


^ TrajNet website: (http://trajnet.stanford.edu/, last accessed 19.12.2019) 


Nevertheless, the Social-LSTM is one of the first proposed RNN-based archi- 
tectures which includes human-human interaction and laid the basis for archi- 
tectures like presented in the work of Hasan et al. [Has18], Xue et al. [Xue18], 
and Zhang et al. [Zha19]. Single motion is modeled with an RNN or rather 
an LSTM network. By applying some of the proposed factors to the models, it 
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is expected that the models and corresponding extensions are able to outper- 
form the proposed single motion predictor in datasets with high pedestrian 
density. 


4.2.1.4 Summary 


Although a part of the presented modifications is tailored to the TrajNet chal- 
lenge, the results show how to enhance prediction performance for RNN- 
based networks. The provided analysis of the datasets reveals some weak- 
nesses of the challenge, such as no defined categorization into linear and non- 
linear or with and without human interaction. However, it is difficult to pro- 
vide standardized benchmarking in particular due to the fast-growing body 
of different approaches for capturing object dynamics. In terms of our goals, 
the trajectory data reflects the desired properties of observations provided by 
an underlying visual tracking component with varying noise levels. Hence, 
the results show that RNN-based models are able to deal with this maneuver 
type. Due to the emphasized modifications, the proposed network achieves 
state-of-the-art performance compared to existing alternative approaches on 
a public benchmark dedicated to path prediction. It is clear that independently 
from the model complexity, approaches restricted to observing only informa- 
tion from a single trajectory are in range to a reachable performance limit 
on the current dataset repository. However, the TrajNet benchmark also pro- 
vides human-human and human-space information. Recent work such as the 
approaches of Gupta et al. [Gup18] or Xua et al. [Xue18] (human-human) and 
Sadeghian et al. [Sad19] (human-human, human-space) show possibilities of 
how to better anticipate the pedestrian behavior based by integrating these 
contextual cues. 
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4.2.2 Intention Prediction 


In this section, the ability of a proposed RNN-based solution with respect to 
switching dynamics of pedestrians is evaluated for scenarios within the ap- 
plication domain of intelligent vehicles. In the domain of intelligent vehicles, 
pedestrian intention is mainly used as part of the overall behavior analysis 
of a vision-based active safety system and applied jointly with path predic- 
tion. Due to the short time window for emergency breaking, physics-based 
approaches are the preferred solution. We consider concrete scenarios that 
are tackled with multiple-model approaches such as the IMM filter. Although 
many approaches can relatively reliably predict the location of objects a few 
seconds ahead, they still struggle to predict when the object will stop [Rid18, 
Has15b]. Hence, the scenario of stopping pedestrians with the corresponding 
physics-based multiple-model solutions is analyzed in particular. 


Inspired by an IMM filter solution, we propose an RNN-based IMM filter sur- 
rogate for improved handling of varying dynamics over time. On the one 
hand, the presented RNN-based model is able to provide a confidence value 
for the performed dynamics. On the other hand, it can overcome some lim- 
itations of the classic IMM filter. The proposed RNN-IMM incorporates the 
insight from section 4.2.1 and is based on an RNN-encoder-decoder network 
introduced by Deo and Trivedi [Deo18] for the case of freeway traffic predic- 
tion. 


In the following, a description of the RNN-IMM is provided before different 
maneuver scenarios including corresponding common physics-based dynam- 
ical models for Kalman and IMM filters are analyzed. 


4.2.2.1 RNN-based IMM Filter Surrogate 


The goal is to devise a model that can successfully predict future paths of 
pedestrians and represent alternating pedestrian dynamics, e.g., dynamics 
that can transition from straight walking to a turning maneuver or to stop- 
ping. Both the IMM filter and the RNN-IMM predict a parametric distribution 
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over object states and jointly capture maneuver probabilities for subsequent 
processing. 


As in the previous section, trajectory prediction is formally stated as the 
problem of predicting the future trajectories of a pedestrian, conditioned on 
its track history. Given an input sequence Z = ((x*,y*) € R?|k = 1, ... Kops} 
of Kops consecutive observed pedestrian positions z* = (x*,y*) at time step 
k along a trajectory, the task is to filter the current position 2* = (x*,y*) 


and to generate a multi-modal prediction for the next kpreq positions 
{zk+1 zkt2 z* +Kprea}, 


maneuver 
Z encoding 


Figure 4.16: Visualization of the RNN-based IMM filter surrogate (RNN-encoder-decoder net- 
work) for jointly predicting specific dynamical probabilities and corresponding dis- 
tributions of future trajectory positions. The encoder predicts the dynamical prob- 
abilities and the filtered position for the current time step. The decoder uses the 
context vector and the position estimate to predict future pedestrian locations. 


As discussed in section 4.2.1, it takes considerably more modeling effort 
to represent all possible conditioning positions. Thus motion continuity is 
easier to express in offsets or velocities. In order to exploit scene-specific 
knowledge for trajectory prediction, additional use of the position informa- 
tion is required. In our work [Hug17], we showed that RNN-based trajectory 
prediction models are able to capture spatially dependent behavior changes 
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only from motion data for sufficient training samples. In order to analyze the 
RNN-based model capabilities in prototypical maneuver scenarios, mostly 
synthetic data is used as an ideal sanity check performance evaluation. For 
a fixed reference system, position information is used to estimate the true 
position. In addition, the offsets are used for conditioning the network 
Z= IF ps AKAK) € R^|k = 2,...,k;p,). The future trajectory is denoted 
with Y = ((x*,y*) e R?|k = Kops + 1,... ‚Kprea} and the filtered position 
with x'* = z*, Compared to Bayesian filtering, x * is not the full dynamical 
state, but the observable state z*after applying the RNN equivalent of the 
observation model (see equation 2.9). This expression is used to highlight the 
analogy to the IMM filter, but the notion of deterministic inputs for RNNs is 
kept. The model estimates the conditional distribution px *|2). In order 
to identify specific dynamics under M desired maneuver classes (e.g., turning 
maneuvers, stopping, and straight walking), this term can be given by: 


M 
POX F|Z) = Y Pornn-mu X Imi z)POn;| 2). (4.22) 


kopstl Kpred 
Here, Ornn-ımMm = {ORVN_1mM>-> ORNN-IMM} are the parameters of 


an L component Gaussian mixture model Diner = (uk, sk wa, ir 
By adding the maneuver context in form of the posterior mode probability, 


P(mi|Z) = a; the analogy to the classic IMM filter becomes apparent. 


For an IMM filter, the mode probability is used to calculate the mixing proba- 
bilities to combine the set of chosen candidate models into a merged estimate 
(see equation 3.50). In case of using an IMM filter, the time behavior of the ba- 
sic filter set is modeled as a homogeneous (time-invariant) Markov chain with 
a fixed transition probability matrix (TPM) pij £ P(mk Imf ="). As shown in 
section 3.1.2, the posterior density of the IMM filter can be written under the 
assumption that M models describe the variation of the dynamics as follows: 


M 
p(x*|Z) = Dy Pena, | mi 2) PO). (4.23) 
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Here, Porm Imi ,2) is in the context of an IMM filter a Gaussian distri- 


bution and P(m;|2) i a; is the posterior mode probability for the IMM fil- 
ter. As explained, the transition between different dynamics is modeled as a 
first-order Markov chain for an IMM filter. The law of total probability al- 
lows computing new mode probabilities based on the transition probabilities. 
Given the current mode probabilities and transition probabilities, the weight- 
ing probabilities a); for the mixing step of the IMM filter can be calculated. 
For each model m; and m;, they are calculated as akiz" = Ve; pia | with a 


normalization factor ¢ = e pat (see section 3.1.2). Then, in the pre- 
diction stage, each filter is applied independently using the calculated mixed 
initial condition. Subsequently, the model probabilities are adapted according 
to the likelihood of each filter. 


Whereas explicit modeling of the switching behavior and the object dynam- 
ics of the IMM filter stands in contrast to an implicit dynamic encoding of an 
RNN-based approach. In order to provide an IMM filter surrogate, the pro- 
posed model also estimates mode probabilities and filters or rather de-noises 
the current position based on noisy observations Z. By writing the condi- 
tional distribution p( x |Z) of the RNN-based approach in form of equation 
4.22, the desired estimates can be inferred from the hidden states of the RNN 
h. This formulation does not require to set the parameter of the TPM ma- 
trix manually, which is commonly done based on the mean sojourn time (the 
mean time an object stays in a motion type [Sch13, Bar02]) or as stated in the 
work of Bar-Shalom [Bar02], an ad-hoc approach is filling the diagonals with 
values close to one. For the proposed RNN-based IMM filter surrogate, the ba- 
sic architecture is a recurrent encoder-decoder model. The encoder takes the 
frame by frame input sequence Z. The hidden state vector of the encoder is 
updated at each time step based on the previous hidden state and the current 
observation. The generated internal representation is used to predict mode 


probabilities &* at the current time step and the filtered position x, 
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With an embedding of the current observations, the encoder can be defined 


as follows: 
ee = EMB(zk; Oee), 
ht "ae RNN(hEZ ete Bene); 
ah gits 27,2% = MLP(h Eu Grp), 
exp (ak...) 
ak = _ SP Flogiis! (4.24) 


M k 
Lg exp (Kogits,j) 


Here, RNN(-) is the recurrent network, h the hidden state of the RNN, MLP(-) 
the multi-layer perceptron, and EMB(-) an embedding layer. O(.; represents 
the weights Wọ.) and biases by.) of the MLP, EMB, and respectively RNN. The 
final state of the encoder can be expected to encode information about the 
track history. For generating a distribution over trajectories conditioned on 
dynamical modes, the encoder hidden state is appended with an one-hot en- 
coded vector corresponding to specific dynamics and the filtered current po- 
sition. The decoder of the model can be defined as follows: 


hi = hf. 
e = ee 2 a ; Odec), 
= (Af + & oos, £F WHE ky. +1 = MLP(hiee; Oae). (4.25) 


The decoder is used to parametrize an MDN or rather Ornn-ımMm directly 
for several positions in the future. Although the overall RNN-IMM uses the 
trajectory prediction and dynamical classification jointly, the loss function 
for training is split into three parts. Dynamical classification is trained to 
minimize the cross-entropy of the different M dynamical modes: 


M 
£(Z)maneuver — >> at or log(&f). (4.26) 
j=l 


Additionally, the encoder is trained by minimizing the filtering loss £(2) fitter 
in form of the negative log-likelihood of the ground truth current pedestrian 
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position under the predicted position. Finally, the complete encoder-decoder 
is trained by minimizing the negative log-likelihood for the ground truth fu- 
ture pedestrian locations conditioned on the performed maneuver class. The 
context vector is appended with the ground truth values of the dynamical 
classes for each training trajectory. This results in the following loss func- 
tion: 


£(Z)prea = —log (Pogyy. ju (GTI mor. Z)P (mc |2)) 


K L 
=  -lg|»,UrNG*|Ar + 2*o, £r mer) |. 
k-kop, 1 li 


(4.27) 


The overall architecture is visualized in figure 4.16. The context vector com- 
bines the encoding of the track history with the encoding of the alternating 
dynamical classes. Together with the filtered position, it is used as input for 
the decoder. 


4.2.2.2 Evaluation 


This section consists of an evaluation of the proposed RNN-IMM. The evalua- 
tion is concerned with verifying the overall viability of the approach in ma- 
neuver situations in terms of switching motion behavior. Firstly, synthetic 
test conditions are used in order to gain insight into the model behavior in 
different typical pedestrian maneuvers. By doing that, factors such as a re- 
stricted amount of training samples are avoided. Later, the evaluation is also 
done on the Daimler context path prediction dataset [Koo14], which is a real- 
world dataset designed to capture pedestrian maneuvers from a driving vehi- 
cle. The synthetic data reflects the Daimler context path prediction dataset and 
Daimler path prediction dataset [Sch13] by capturing similar condition and 
using the statistics in the data for generating samples. Both datasets capture 
sequences recorded with the same sensor setup or rather the same vehicle. 
The 2014 dataset version is focused on crossing and stopping maneuvers. 


121 


4 The Deep Learning Perspective 


In the domain of intelligent vehicles, intention prediction of pedestrians is com- 
monly done in an ego-motion compensated vehicle centered coordinate sys- 
tem. The detections provided by an object detector are mapped onto a ground 
plane in world coordinates. For the Daimler datasets, a stereo camera-system 
(baseline 22 cm, 16 frames per second (fps), 1176 x 640 pixels) is used, in- 
ter alia, for mapping the observation to the physical world. Thereby, the in- 
corporation of prior knowledge about the dynamics of pedestrian motion is 
enabled. As explained in section 2.3.2, there exist several physics-based dy- 
namical models that are applied in combination with Bayesian filters under 
these conditions. The choice of selected physics-based reference approaches 
is orientated on a comparative study from Schneider et al. [Sch13] on re- 
cursive Bayesian filters for pedestrian path prediction at short time horizons 
(below 2 seconds) on the Daimler path prediction dataset. For single-models 
in combination with a Kalman filter, a CV and a CA model are considered 
for crossing scenarios. These dynamical models are also used for the predic- 
tion of pedestrian positions by Bertozzi et al. [Ber04], Meuter et al. [Meu08], 
Megelmose et al [Mog15], Binelli et al. [Bin05], and Elnagar et al. [Eln01] to 
name a few. Further, the proposed IMM filters of Schneider et al. are the core 
of the introduced extensions for including several contextual cues to control 
the transition probability between single dynamical models (see section 2.3.2 
[Koo19, K0014, Sch15]). 


Since the Daimler path prediction dataset provides only a maximum number 
of 23 sequences for single motion types, in order to avoid problems such as a 
limited number of training samples and to gain some insights into a controlled 
setup, synthetic data is used to perform a sanity check analysis. Furthermore, 
there is a location bias in the Daimler datasets, which is more present for the 
2013 version and the bending in maneuvers. Since recursive Bayesian filters 
make no use of the spatial context of a scene in their standard formulation, 
this does not harm their mutual comparison. In section 2.3.1, we discussed 
the ability of RNN-based prediction networks to capture spatially dependent 
behavior changes. In order to make a fairer comparison for the real-world 
data evaluation, all tracklets are normalized to start at the origin. 
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The evaluation on the Daimler 2014 dataset is done in an ego-motion com- 
pensated reference system. The frame rate of the camera system inside the 
recording vehicle is 16 fps and it is adopted accordingly for our experiments. 
Pedestrians change their behavior abruptly. Therefore, sensible time horizons 
are short. Here, 8 (0.5 seconds) consecutive positions are observed, before pre- 
dicting the next 8 (0.5 seconds), 12 (0.75 seconds), and 16 (1 second). 


For generating synthetic trajectories of a basic maneuvering pedestrian, ran- 
dom agents are sampled from a Gaussian distribution according to a preferred 
pedestrian walking speed [Tek02] (N (1.38 m/s, 0.37 (m/s)?)) from the distribu- 
tion of starting positions ofthe corresponding Daimler dataset sequences. The 
distribution of starting position is approximated with a mixture-of-Gaussians 
by using the expectation-maximization (EM) algorithm [Bis06] with five com- 
ponents. The chosen preferred walking speed corresponds approximately to 
TrajNet dataset analysis in section 4.2.1. 


The RNN-IMM models have been implemented using Tensorflow [Aba15] and 
are trained for 2000 epochs using ADAM optimizer [Kin15] with a decreasing 
learning rate, starting from 0.01 with a learning rate decay of 0.95 and a decay 
factor of 1/10. For the experiments, the RNN variant LSTM [Hoc97] is used. 


4.2.2.3 Scenario: Crossing 


In the first scenario, the pedestrian intending either to cross or not to cross 
the street laterally is considered. During a single trajectory simulation, the 
agents head laterally and can perform a stopping maneuver or cross the street. 
Figure 4.17 illustrates such maneuvers with example images from the Daimler 
dataset [Sch13]. For mapping the pedestrian detections to a vehicle-motion 
compensated ground plane, Schneider et al. used onboard sensors for velocity 
and yaw rate, and a stereo camera system to compute the median disparity 
based on a dense stereo approach (semi-global matching) [Hir08]. 


Due to the non-linear observation model based on a perspective camera 
model, an inevitable linearized extension for the Kalman and IMM filter 
observation models is required. 
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Figure 4.17: Illustration of typical pedestrian motions. The above images depict the two chosen 
maneuver classes of straight walking or crossing and stopping. The images on the 
left show a person crossing the street. The images on the right show a person 
transitioning from walking to standing at the curbside of the street. In particular 
changing from straight walking to stopping [Sch13]. 


Here, the observation uncertainty is assumed to be Gaussian distributed w* ~ 
IN(0, (0.01 m)?) in the compensated reference system. Thus, the standard for- 
mulation of the Bayesian filters is well suited for this task. For the stopping 
maneuver or rather the event of deceleration until standing, a mean sojourn 
time of 1 second with a standard variation of 0.1 seconds is used. As a re- 
minder, the mean sojourn time Tso is the mean time an object stays in a mo- 
tion type or dynamical mode [Sch13, Bar02]). Blackman [Bla99] suggests us- 
ing this time to specify the TPM instead of using the ad-hoc approach to fill 
the diagonals with values close to one. Thus, the model-to-model transition 
is set to mj; = 1 — AT/r,, (see equation 3.37). The sojourn times are derived 
from the Daimler datasets. 
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As long as a person moves in a straight line at a nearly constant speed, their 
dynamics can be captured with a Kalman filter using a CV model. During the 
maneuver, the relation to one fixed motion model describing the dynamics 
fails due to an additional deceleration. Similar to Schneider et al. [Sch13] or 
Kooij et al. [Koo14], one reference IMM filter is set up by combining two basic 
models, in particular the CV and the CA model. As discussed in section 3.2.1, 
side effects due to independent motions in different directions are avoided by 
only considering the crossing direction or - from the vehicle perspective - the 
lateral motion. In other existing work, a combination of a CP model and a CV 
model is suggested because of the relatively short decelerating phase (see for 
example [Kel11]). Furthermore, an IMM filter with the three dynamic models 
of a CP, a CV, and a CA model is for example used in the work Goldham- 
mer et al. [Gol14]. Thereby, a transition from straight walking to stopping 
is modeled in a physically plausible manner with a deceleration phase. Fol- 
lowing the aforementioned explanations, the IMM-RNN is compared to the 
described IMM filters with two dynamical models ((CV, CA); (CP, CV)), an 
IMM filter with three dynamical models (CP, CV, CA), a Kalman filter with a 
single CV model, a Kalman filter with a single CA model, and as baseline to a 
linear interpolation. 


Also correspondingly to Schneider et al., the process noise v* is determined 
by Q* = QEq, where q € (oap, od, 02.4] describes the changes in position, 
in velocity or respectively in acceleration over a sampling period AT. The 
covariance matrix Qo of a CV (white noise acceleration) [Bar02] model is given 


by 


0.25AT^ 0.5AT? 
k 
Qcv =| gosAT2 AT? Sey: (4.28) 


The physical dimension of ocy is that of an acceleration. For the CA (Wiener 
process acceleration) [Bar02] model 


0.25AT^ 0.5AT? 0.5AT? 


ok, =| 0.5AT? AT? AT 1064: (4.29) 
0.5AT? AT 1 
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In this model, the process noise v* is the acceleration increment during the 
kth sampling period. Based on the above process noise model, Schneider 
et al. [Sch13] estimated the process noise parameters for the different cho- 
sen filters (IMM filter (CV, CA), Kalman filter CV, CA) on the Daimler data- 
set. These noise parameters are for the IMM filter (CV, CA) orum,cv = 
0.7 m/s2, Gt» M cA = 0.8 m/s? and for the single Kalman filters oc» = 0.77 m/s? 
and oca = 0.44 m/s. For the CP model, the noise parameter can be set 
based on the steady walking pace. For the IMM filter (CP, CV), we used 
the settings similar to Keller et al. [Kel11] (ormm,cp = 0.1 ™/s, op M, cv = 
0.09 m/s2). The noise parameters of the three model IMM filters are oj» M,cp = 
0.1 m/s, Gp» M cy. = 0.7 m/s, and Ormm,ca = 0.8 m/s3. These parameters are 
consistent with the suggested practical setting in Bar-Shalom [Bar02] and the 
chosen sojourn time for the simulation. 
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Figure 4.18: Visualization of the predicted multi-modal distributions of future position as 
heatmap. (Left) Density plots for crossing or rather straight walking examples. 
(Right) Density plots for stopping examples in which the maximum of the predicted 
distribution is visible close to the last observation. 
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Compared to vehicle maneuvers, such as lane changing, a definition of ma- 
neuver classes for pedestrians is harder to establish. Since in most cases, the 
standard behavior of pedestrians is straight walking, the deviation from a 
standard behavior, and whether the pedestrian is in a normal mode is here 
detected. A fixed deviation in velocity, deceleration, along with the tangen- 
tial ground truth trajectory is used to assign a maneuver label to a time step 
of a single trajectory. Thus, the RNN-IMM and IMM filters have a similar dy- 
namical model set description. As the distribution over the trajectories for 
the RNN-IMM is captured with a Gaussian mixture model, the maneuver de- 
scription for a single model can still be multi-modal. Since the IMM filter 
predicts a multi-modal distribution in the form of a combination of the uni- 
modal model-specific predictions, the RNN-IMM is set to equally predict a 
uni-modal Gaussian distribution conditioned on a single maneuver class in 
the presented results. 


Table 4.5: Results for the comparison between an RNN-IMM and IMM filters with several dy- 
namical model setups, Kalman filters with single models, and using linear interpola- 
tion on the simulated maneuver situations of crossing and stopping. The prediction is 
done for 8, 12, and 16 time steps conditioned on 8 observations for a frame rate of 16 


fps. 
8/8 8/12 8/16 
Approach FDE/m = oOge/m | FDE/m op/m | FDE/m Oopp/m 
RNN-IMM 0.0309 0.0404 0.0427 0.0817 0.0517 0.0941 


IMM filter (CP, CV, CA) | 0.0612 0.0606 0.113 0.130 0.1736 0.1901 
IMM filter (CV, CA) 0.0674 0.0602 0.1188 0.1255 0.1862 0.1915 
IMM filter (CP, CV) 0.1073 0.0916 0.2031 0.1623 0.3101 0.214 
Kalman filter (CA) 0.0796 0.0638 0.1575 0.1137 0.2386 0.1696 

Kalman filter (CV) 0.1578 0.1601 0.2890 0.2965 0.4701 0.4700 
Linear interpolation 0.1587 0.1610 0.2903 0.2978 0.4724 0.4718 


In figure 4.18, predictions for two differently performed motion types are de- 
picted for 8 future positions weighted by the predicted maneuver probability. 
In the shown images, the positions are normalized to start at the origin. The 
resulting multi-modal prediction is visualized as a heatmap. On the right, it 
can be seen that for a crossing sequence with straight walking, the RNN-IMM 
mainly uses the corresponding straight walking model. On the left, where 
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the deceleration started, the straight walking probability is visibly lower, and 
the predicted distribution maximum is very close to the last observation. For 
the quantitative evaluation, 1000 noisy trajectories have been synthetically 
generated, where 80% are used for training and 20% for the comparison to the 
recursive Bayesian filters. The results are summarized in table 4.5. 


The performance is compared using the final displacement error (FDE) (see 
for example [Pel09]) of the lateral motion (from the vehicle perspective) for 
three different time horizons, in particular 8 steps (0.5 seconds), 12 steps 
(0.75 seconds) and 16 steps (1 second). These results show that the presented 
RNN-IMM is able to capture the changing varying dynamics for the synthet- 
ically generated data faster. In terms of the single motion models (CV versus 
CA), one can observe the benefits for the CA in capturing the deceleration. 
Since pedestrian acceleration or declaration phases are relatively short due 
to pedestrians quickly reaching their preferred walking speed or a stopping 
state [Gol17], the CA and higher-order dynamical models can lead to high 
prediction errors for longer time horizons. In order to capture the maneuver, 
the switch to other models is beneficial. For the crossing and stopping simu- 
lation, the IMM filters show an overall improvement over the single models. 
The best result is achieved with the three model IMM filter. The RNN-based 
IMM filter surrogate is able to capture the switch to a stopping mode. The 
engineering task of finding the best model set for IMM filters and their ex- 
tensions can lead to improved behavior (see for example Keller et al. [Kel14]) 
in specific maneuver situations, but it is also a very tedious process to find 
a good setting. As discussed, recent work like the approaches of Kooij et al. 
[Koo19] show options how to further improve the prediction performance 
of IMM filters by using a DBN on top and thereby including scene context 
and more cues than pedestrian point kinematics (e.g. head orientation, gaze, 
body tilt, articulated body information). The integrated cues and accordingly 
additional latent states modify only the transition probability between single 
dynamical models, which is not required for the RNN-IMM. 


However, the presented RNN-IMM is able to also provide a confidence value 


P(m;|2) i a; for the performed dynamics, but without an explicit modeling 
ofthe dynamics transitions in the form of a fixed TPM. Similar to the provided 
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mode probabilities of IMM filters, this can be used for subsequent processing 
stages (see for example our works [Bec15, Mün16a, Mün16b]). Further, in- 
stead of choosing the basic filter set, the prediction model is learned. In case 
there exists some well-known model for describing the standard dynamics of 
the desired object, only deviations from the known dynamics can be used to 
define additional maneuver classes. 


4.2.2.4 Scenario: Turning 


Figure 4.19: Illustration of a typical pedestrian maneuver. The above images depict a change 
from straight walking to turning and thus a sudden crossing of the street (bending 
in). 


Another prototypical maneuver performed by a pedestrian, which is of keen 
interest in the context of intelligent vehicles, is a turning maneuver. For 
such a maneuver, the pedestrian's dynamics changes from a straight walk- 
ing to a bending in behavior. Similar to the simulation of the crossing/stop- 
ping scenario, for simulating a basic maneuvering pedestrian, random agents 
are sampled from a Gaussian distribution in accordance with common pede- 
strian walking speeds [Tek02] (V(1,38 m/s, (0.37 m/s)?)) from the distribution 
of starting positions of annotated bending in sequences from the Daimler path 
prediction dataset. Hence, the same fixed frame rate of 16 fps is used. During 
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a single trajectory simulation, the agents can perform a turning maneuver. 
The change in heading is sampled from an uniform distribution between 45* 
and 100°. The duration of the turning event is sampled from a Gaussian dis- 
tribution based on the mean sojourn time estimated from the ground truth 
sequences (N (1.83s, (0.295)?)). For comparing to common physics-based dy- 
namical models, the simulation is done on ground level. In this scenario, the 
longitudinal motion is also crucial to capture such a maneuver. In addition to 
the above chosen filters, Keller et al. suggest a combination of a CV model 
and a CT model as elemental filters to model the switching behavior. The 
corresponding joint state vector of the CT model with the turning rate c can 
be expressed as 


N 1n 
XE = [X,y,%,),0] . (4.30) 


Since the model is non-linear, the estimation of the state is done via an EKF 
[Bar02]. The corresponding dynamical model for state prediction is given by 


1 sin(wKAT) 0 cos(coKAT —1) 0 
wk ak 
0 1—cos(wk AT) 1 sin(coKAT) 0 
xh = wk wk xt, + GE, vk 
er 0 cos(o*AT) 0 —sin(wkKAT) 0 | €T et 
0 sin(c*AT) 0 cos(wkAT) 0 
0 0 0 0 1 
——————————————/ 
Fér 
(4.31) 
Despite the matrix form, the model can be equivalently expressed as a set of 
equations fcr. To use an EKF for estimating, the Jacobian J Er = Mer |. of 
Oxor ACT 


the dynamical equations must be computed for calculating the transition for 
the state covariance. 
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sin(coKAT) cos(wK AT —1) ôx 
1 — 0 — |k 
wk ak óc XcT 
1—cos(w* AT) sin(o* AT) óx 
re o el CE o p 
k [7] [5] oe RCT 
Jér—] o cos(c AT) 0 -sin(wkAT) je (4.32) 
= XeT 
O sin(oFAT) O0 cos(woKAT) 2 
óc XcT 
0 0 0 0 1 Vet 
ACT 
with the partial derivatives with respect to the turning rate 
ôx T wo AT cos(wK AT) — sinc AT) ‚x 
dw sc wk? 
wKAT sin(wK AT) + cos(w AT — 1) ., 
ees = aes cy 
Ox ‘ k ok k ak 
ERA. = —AT sin» AT)x* — AT cos(w*AT)y 
ôy just wo AT sin(wk AT) + cos(wFAT) ‚x 
Se es 
ok AT cos(o* AT) — sin(w* AT — 1) ., 
s = —AT cos(w¥ AT)x* — AT sin(w* AT)yK. (4.33) 
CT 


For the noise process model used by Schneider et al., the process noise covari- 


ance matrix has the form 


0.25AT* 0.5AT? 0 0 0 
0.5AT? AT? 0 0 0 
ok, = 0 0 0.25AT* 0.5AT? 0 E 
EP Y 0 0 O.SAT? AT 0 d 
S2 AT? 
0 0 0 0 an 
(4.34) 
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Here, 624 is the turning rate variance and o, corresponds to a random ac- 
celeration (see equation 4.28). For the CT model as part of the IMM (CV, CT) 
filter, the noise parameters are chosen according to Schneider et al. [Sch13] 
(ormm,cv = 0.4 m/s, ormm,cr = 0.9 rad/s?). The noise parameters for the 
other dynamical models are chosen as in section 4.2.2.3. As discussed in sec- 
tion 3.2.1, using the proposed de-coupled IMM filter can be beneficial due to 
independent motion along a particular direction. Hence, a de-coupled IMM 
filter with similar parameter setup is additionally used for comparison. 
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Figure 4.20: Visualization of the predicted multi-modal distributions of future position as 
heatmap for the bending in scenario. Two density plots showing different points 
in time during the turning maneuver. 


In case of turning, the motion changes from a rectilinear dynamics to a curvi- 
linear motion, in relation to the dynamics this results in an additional accel- 
eration or rather change in heading. Therefore, a change from a constant 
velocity model to a turning model or acceleration model indicates a critical 
situation from the vehicle perspective. Figure 4.19 illustrates such a turning 
or bending in maneuver. 


Since the standard behavior of pedestrians is straight walking, a fixed devia- 
tion in heading for a required time-horizon is used to assign maneuver labels 
to single trajectories. The RNN-IMM outputs a multi-modal path distribution 
based on the one-hot encoded maneuver classes. Here, the conditioning is 
done for the three maneuver classes of turning left, turning right and straight 
walking. For the synthetic bending in scenario, the maneuver distributions 
could be captured with one Gaussian component even though the trajectory 
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distribution can still be multi-modal by using an MDN with more components. 
In figure 4.20, example multi-modal predictions for the bending in maneuver 
scenario are visualized. The predicted density is visualized as a heatmap with 
normalized positions starting at the origin. The two example images show 
two turning situations at different points in time. On the left, the straight 
walking probability is still dominant. On the right, the RNN-IMM captures 
the maneuver and predicts a clockwise change in heading. 


Table 4.6: Results for the comparison between an an RNN-IMM and IMM filters with different 
dynamical model structures, a Kalman filter with a single CV model, a Kalman filter 
with a single CA model, and using linear interpolation on the simulated bending in 
maneuver situations. The prediction is done for 8, 12, and 16 time steps conditioned 
on 8 observations for a frame rate of 16 fps. 


8/8 8/12 8/16 

Approach FDE/m pg/m FDE/m op/m|FDE/m = opp, /m 
RNN-IMM 0.1009 0.1066 0.1833 0.2073 0.2479 0.3387 
IMM filter (CV, CA) 0.1641 0.1068 0.3274 0.2131 0.5109 0.3732 
IMM filter (CV, CA; de-coupled) 0.1626 0.1248 0.3119 0.2519 0.5055 0.4138 
IMM EKF (CT, CV) 0.1988 0.1575 0.3482 0.2745 0.5102 0.4227 
Kalman filter (CA) 0.1809 0.1145 0.3664 0.2497 0.5344 0.3257 
Kalman filter (CV) 0.2098 0.1707 0.3729 0.3007 0.5301 0.4523 
Linear interpolation 0.2530 0.2084 0.4326 0.3543 0.6122 0.5100 


For training and evaluation, 1000 noisy trajectories have been synthetically 
generated with a split of using 80% for training and 20% for evaluation. The 
results are summarized in table 4.6. For comparison, the final displacement 
error (FDE) is calculated as average L2 distance between the predicted final 
positions and the ground truth positions for three different time-horizons. 


As before, the proposed RNN-IMM achieves the best result for the simulated 
scenario and is able to capture the switch to another motion type. The IMM 
filter solutions perform better than single model Kalman filters. Comparing 
the different model set structures of the IMM filters, our de-coupled IMM filter 
yields slightly better results, but with no significant difference. For the longer 
time-horizon, the effect of curvilinear motion is more pronounced, thus the 
benefit of the RNN-IMM is more visible, and the inclusion of the CT model 
in the filter setup has a more positive effect. The CT model and its variants 
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are more common for being utilized for tracking other road users, such as 
bikes and vehicles (see for example [Koo19, Bat09]) or capturing the turning 
maneuver of tracked air-crafts [Li03]. Solely the amount of existing physics- 
based models and model set combinations clearly show why the trend is to 
shift to a pattern-based alternative. Also, in the bending in scenario, the RNN- 
based IMM filter surrogate is able to capture the switch in modes without the 
engineering task of finding the best model set for the IMM filter. 


4.2.2.5 Scenario: Real-World Data 


Table 4.7: Results for the comparison between an RNN-IMM and several filters, including differ- 
ent IMM filters and single Kalman filter on the Daimler context path prediction dataset 
[Koo14]. The prediction is done for 8, 12, and 16 time steps conditioned on 8 observa- 


tions. 
8/8 8/12 8/16 
Approach FDE/m Oojgg/m|FDE/m ope/m | FDE/m GOpg/m 
RNN-IMM 0.0811 0.1165 0.1244 0.2076 0.2137 0.3313 


IMM filter (CP, CV) 0.1609 0.1495 0.2746 0.2518 0.3881 0.3711 
IMM filter (CP, CV, CA) | 0.1721 0.1688 0.3060 0.2753 0.4202 0.3925 
IMM filter (CV, CA) 0.1792 0.1641 0.3242 0.2790 0.4602 0.4178 
Kalman filter (CA) 0.2061 0.2240 0.5260 0.4055 0.8112 0.6300 
Kalman filter (CV) 0.1618 0.1507 0.2749 0.2519 0.3885 0.3711 
Linear interpolation 0.1628 0.1511 0.2773 0.2541 0.3918 0.3745 


For the real-world data scenario, the evaluation is done on the Daimler con- 
text path prediction dataset [Koo14] consisting of 58 sequences recorded with 
the described sensor setup in an ego-motion compensated reference system. 
All sequences involve individual pedestrians intending to cross the street or 
to stop at the curbside (crossing and stopping maneuvers). The sequences are 
further labeled with time-to-event (TTE) (in frames) information to focus 
on the critical situations. For stopping pedestrians, the frame when the last 
foot of the pedestrian is placed on the ground of the curbside is labeled with 
TTE= 0 and for crossing pedestrians, the closest point before stepping on the 
road. Only the lateral motion and frames between TTE« —15 and TTE» 15 
are considered. 5-fold cross-validation is done for tracklets capturing the 
time windows of 16, 20, and 24 consecutive position, where 8 positions are 
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used for initialization. This matches with the setting of the simulated cross- 
ing sequences. In order to reduce side effects due to limited training samples, 
additional sequences are augmented to extend the total amount of training 
sequences to 200. According to the estimated lateral observation error, the 
observation noise is set to w* ~ N(0, (0.06 m)?) in the vehicle-motion com- 
pensated reference system. The evaluation of the real data is done on 20% of 
the real-world tracklets. Table 4.7 shows the final displacement errors for the 
lateral positions for three time-horizons. In table 4.8 and 4.9 the results are 
split into the crossing and stopping sequences. 


Table 4.8: Evaluation results for the crossing sequences of the Daimler context path prediction 
dataset [Koo14]. The final displacement error is shown for predicting 8, 12, and 16 
steps into the future. 


Crossing sequences 
8/8 8/12 8/16 
Approach FDE/m ope/m | FPE/m op/m | FDE/m op/m 
RNN-IMM 0.0841 0.1219 0.1224 0.2076 0.1978 0.3174 
IMM filter (CP, CV) 0.1490 0.1387 0.2522 0.2405 0.3456 0.35197 
IMM filter (CP, CV, CA) 0.1601 0.1640 0.3160 0.2798 0.4202 0.3946 
IMM filter (CV, CA) 0.1492 0.1435 0.3217 0.2809 0.4450 0.4201 
Kalman filter (CA) 0.1906 0.16623 0.5764 0.4090 0.8770 0.6400 
Kalman filter (CV) 0.1491 0.1387 0.2523 0.2406 0.3460 0.3520 
Linear interpolation 0.1526 0.1442 0.2545 0.2427 0.3490 0.3553 


Table 4.9: Evaluation results for the stopping sequences of the Daimler context path prediction 
dataset [Koo14]. The final displacement error is shown for predicting 8, 12, and 16 
steps into the future. 


Stopping sequences 


8/12 8/16 
Approach FDE/m = Oppg /m | FDE/m opp/m | FDE/m orp/m 
RNN-IMM 0.0693 0.0934 0.1316 0.2075 0.2766 0.3755 

IMM filter (CP, CV) 0.1890 0.1627 0.3588 0.2742 0.5207 0.4027 

IMM filter (CP, CV, CA) 0.1581 0.1450 0.2886 0.2542 0.4404 0.384 

IMM filter (CV, CA) 0.1797 0.1602 0.3336 0.2715 0.5567 0.3959 
Kalman filter (CA) 0.1799 0.1784 0.3376 0.3254 0.5507 0.5114 
Kalman filter (CV) 0.1897 0.1628 0.3592 0.2745 0.5572 0.3963 
Linear interpolation 0.1990 0.1685 0.3627 0.2768 0.5619 0.3993 
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As reference models, the IMM and Kalman filters with the corresponding dy- 
namical models as described in section 4.2.2.3 are used. Also on the real-world 
sequences, the RNN-IMM achieves the best performance. For the comparison 
of the different model set configuration, the difficulty of choosing the best 
configuration becomes clearly visible. Despite the fact that the performance 
difference is relative small, including a CP or CA helps to better capture the 
stopping behavior. As expected, in pure crossing sequences a Kalman filter 
with single CV model performs well. Due to fact that the pedestrians walk 
with a relatively constant speed, the second-order CA model interprets noisy 
observations as additional acceleration and thus performs worse than the first- 
order CV model. The IMM filters are able to better handle this situation. Due 
to larger observation noise in the real-world sequences, the distinction be- 
tween the maneuver situation and finding the best dynamical model com- 
bination is more difficult. As shown, the amount of different proposed dy- 
namical model structures in combination with the general problem of finding 
suited physics-based model, show the benefit of capturing switching dynam- 
ics with RNN-based solutions. In figure 4.21 and figure 4.22, the predicted 
density distributions with an RNN-IMM are visualized for a crossing and re- 
spectively a stopping maneuver for 8 future steps. The lengths of the colored 
bars above the pedestrians depict the model probabilities. For the visualiza- 
tion, the predicted distribution is mapped back to image space using the cali- 
bration information. The future distributions are obtained by sampling from 
the predicted distribution. In the crossing situation, the model classifies the 
straight walking behavior correctly and predicts crossing with a high proba- 
bility. In figure 4.22, the switching behavior is highlighted. In the images on 
the left, the declaration begins but straight walking is still dominant. In the 
images in the middle, the selected model has changed and stopping behavior 
is classified. The predicted positions are closer to the current positions. On 
the right, the pedestrian stands at the curbside and the RNN-IMM recognizes 
the situation correctly. Thus, the predicted density is very close to the current 
position. Although the observation noise level is larger than for the synthetic 
scenarios, the RNN-IMM tends to be overconfident towards one mode. This 
effect is further analyzed in the next section for simulated trajectories under 
better-controlled conditions. 
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4.2.2.6 Scenario: Intention Classification 


Table 4.10: Intention classification results for a comparison between an IMM filter and the pro- 
posed RNN-IMM for generated pedestrian trajectories with a Markovian switch be- 
tween a CV and CA model for increasing noise levels. The firs column shows the 
sensitivity and the second column the average predicted mode probability of correct 
classified dynamical modes. 


sensitivity / average true mode probability 
Approach Gp = 0.001 | o, = 0.005 Tp = 0.01 Gp = 0.05 Gp = 0.1 
RNN-IMM-Mode 0.946 0.963 | 0.818 0.839 | 0.7606 0.787 | 0.629 0.669 | 0.587 0.633 
IMM filter (CV, CA) | 0.893 0.835 | 0.767 0.745 | 0.703 0.613 | 0.619 0.589 | 0.583 0.554 


For the experiments so far, pedestrian intention prediction is considered jointly 
with path prediction. In this section, it is considered as a pure classification 
task. In order to better control the conditions for training and the test sce- 
nario, again synthetic data is used. For the real-world sequences as well as 
for the simulated scenarios in this section, the pedestrian trajectories are con- 
sidered purely deterministic trajectories which consist of fixed-length non- 
maneuvering and maneuvering motion. Although this is common and rea- 
sonable, the underlying assumption of hybrid state systems such as the IMM 
filter is that the true dynamical mode is modeled as a Markov-chain with dis- 
crete dynamical modes. In this section, a pedestrian tracking scenario with 
switching dynamics is simulated as a probabilistic trajectory with random 
Markovian transitions. The scenario of pedestrian motion orientated on the 
Daimler dataset statistic is kept. Both for reference IMM and as well for gener- 
ating sample trajectories, the combination of a CV model and a CA restricted 
to lateral motion is used. In the experiments, the observation noise is in- 
creased step-wise, and the classification accuracy of the RNN-IMM is com- 
pared to its IMM filter counterpart. For the simulation, the dynamics of a 
pedestrian agent can switch between the two models with a switching prob- 
ability of P(mr|mf 1) = 0.1 (j Æ i) or stay in its dynamical mode with 
P(mF|m*-!) — 0.9. The acceleration during a CA phase is sampled from 
a Gaussian distribution (N(1 m/s2,(0.1 m/s2)?)). Drawn decelerations are re- 


jected. The starting velocity is again sampled from a distribution of common 
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walking speeds in accordance with Teknomo [Tek02]. The velocity of a sim- 
ulated agent can exceed the maximum of the physically possible velocity of 
pedestrians to prevent an undesired switch to CV model due to a constant 
maximum value. Thereby, the conditions for the IMM with the two corre- 
sponding dynamic models are idealized. The process noise is modeled with 
the two noise models explained in section 4.2.2.3. For comparison, the true 
positive dynamics classification rate, also referred to as sensitivity, is used. 
The predicted model probability for the current time step is compared with 
the known ground truth mode. The first 8 steps are excluded for filter ini- 
tialization. In table 4.10, the results for an increasing observation noise are 
summarized. The frame rate is set to 16 fps and the process noise parame- 
ters to doy = 0.01 m/s? and oc, = 0.01 ™/s3. For training and evaluation, 
1000 trajectories of 4 seconds with random Markovian transitions between 
a CV model and CA model are generated. The evaluation is done on 20% of 
the generated samples. Alongside the sensitivity, the average mode proba- 
bility for a correctly classified dynamics is shown. Although the conditions 
are suited for an IMM filter, the proposed RNN-IMM can better classify the 
true dynamical mode from the noisy observations. For increasing observa- 
tion noise levels, the sensitivity decreases to less than 60% correctly classified 
dynamical modes for both models. The results show that the RNN-IMM can 
faster switch to other dynamics. However, a distinction between the selected 
dynamical models gets more difficult for larger noise levels. The IMM filter as- 
signs each model a similar probability, and thus both models are equally used 
to estimate the dynamical state. The predicted mode probability of the IMM 
filter fits better to a sensitivity close to 0.5. Hence, the RNN-IMM tends to be 
overconfident with the assigned dynamical model probabilities. The model 
probabilities over time for both models are visualized in figure 4.23. 


The true dynamics for a particular time step is indicated with the background 
color. The CA model is visualized with dark blue (m) and the CV model with 
dark yellow (=). The IMM filter probabilities P(-)ry are shown as a dotted 
line and the RNN-IMM probabilities P(-)pyy as a solid line. The example 
results demonstrate the effect of stronger combining the selected dynamical 
models to describe the dynamical behavior of the object with an IMM filter. 
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The RNN-based solution, on the other hand, makes a harder decision towards 
one model. 


4.2.2.7 Summary 


The presented results demonstrate the ability of the proposed RNN-based 
IMM surrogate to deal with switching motion behavior of a tracked object. 
For the exemplary task of intention prediction in the context of intelligent vehi- 
cles, the RNN-IMM and the IMM filter counterpart are realized as a top-down 
component as part of a visual tracking system. By using a stereo camera sys- 
tem, applying semi-global matching [Hir08] and correcting the ego-motion, 
the detections are mapped to a physical reference system. Thus, the IMM 
filter can rely on prior knowledge of the dynamics of pedestrian motion pro- 
vided by several proposed physics-based dynamical models or rather model 
sets. Under these conditions, the RNN-IMM is able to recognize the change in 
motion type faster and achieves a better performance for jointly estimating fu- 
ture path or for a pure classification task. The model capabilities were shown 
on synthetic data and real-world data, both reflecting typical pedestrian ma- 
neuvers. For comparison, both approaches describe the motion of pedestrian 
point kinematics. It is clear as a basic principle, pattern-based methods are bet- 
ter suited to integrate more contextual cues. Hence, without the restriction of 
sufficient training data and the fast-growing body on datasets for intelligent 
vehicles, the RNN-IMM offers much more potential for extension (see section 
2.3). Compared to the traditional IMM filter, the amount of engineering is 
reduced since the transition probabilities are not explicitly modeled. Another 
reason for the reduced engineering effort is that for the RNN-IMM, a maneu- 
ver is simply modeled as a deviation from standard straight walking behavior 
without modeling dynamical model combinations. 
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Figure 4.23: Visualization of the mode probabilities of an IMM filter and an RNN-IMM model 
for simulated trajectories with Markovian transition behavior. 
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4.2.3 Tracklet Alignment with 
a Minimum Variance Prototype 


So far, we discussed the effect of hard-coded normalization to reduce the vari- 
ation in translation and rotation of pooled tracklet data in order to enable a 
better generalization across datasets. Since the object motion cues are inde- 
pendent of these transformations, the amount of required modeling of object 
motion states can be reduced. Nevertheless, by applying these normalizations 
on tracklet data, the reference point and the reference rotation are arbitrar- 
ily chosen. Hence, the variation is eliminated from the references and just 
shifted along the tracklet. This makes clustering of such tracklets very chal- 
lenging. In addition, rotation normalization is sensitive to out-of-distribution 
input tracklets. When using just two observations for estimating the rotation 
angle, the error in rotation depends on the distance between the two obser- 
vations. Thus, using the first two observations can lead to high rotation error. 
Using observations which lie further apart relies on the assumption that the 
dynamics do not change between both observations. 


In this section, a neural network solution that learns to align the input track- 
lets is proposed. The alignment network learns the required transformation of 
the input tracklets to achieve an optimal matching with an adjustable proto- 
type. Instead of an arbitrarily chosen reference point and a reference rotation, 
a reference tracklet - the prototype - that reflects the minimum variance in the 
input data is learned. The alignment network enables to assess the training 
data by analyzing the prototype. Furthermore, the distance to the learned 
prototype can be used for clustering or identifying out-of-distribution track- 
lets. 


The analysis of the alignment network is done on synthetically generated tra- 
jectories reflecting different dynamical behaviors. In addition, the path pre- 
diction data from the BIWI sequences [Pel09] and the UCY sequences [Ler07] 
is used. 
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4.2.3.1 Alignment Network 


The alignment network can be combined with a prediction network by fram- 
ing the prediction network with two transformation networks (forward and 
backward network). Besides the transformation networks, the central block 
to align the tracklets is the prototype network. In the following, it is re- 
ferred to as bottleneck network. The bottleneck network consists only of a 
freely adjustable prototype tracklet orf without additional learnable pa- 
rameters. This prototype tracklet is a sequence of randomly chosen points 
{x y*) e R2|k = 1,... XKprotol Of Kproto time steps. The bottleneck network 
can be defined as follows: 


Borate = Bott(@gort). (4.35) 


Here, Og,;, are the adjustable prototype points. The idea is to find the best 
possible alignment with the prototype tracklet by simultaneously adjusting 
the prototype and learning the transformation parameters for the input track- 
lets. The resulting prototype represents a tracklet with minimum variance. In 
other words, it reflects the dominant input tracklet structure. In order to align 
the input tracklets with the prototype, the tracklets are transformed by trans- 
formation networks. The forward network (FW : z — 2*) transforms the 
input trajectory to another trajectory space conditioned on the input track- 
let. Thereby, the affine transformations of translation, rotation, and scaling 
are successively applied. Since the object motion cues are independent of 
these transformations, the estimate of the object motion state can be applied in 
transformed trajectory space. The forward network can be defined as follows: 


Cs rw; fz rw, Sz, pw = MLP(z!'Kobs; Oy), 
21 oss = FW(z! Kobs te, fy, Sz). (4.36) 


The terms tz, rz, Sz are the learned translation, rotation, and scale for a given 
input tracklet z!:Kobs of a fixed observation window of Kops time steps. The 
parameters of the forward network and the prototype tracklet are learned by 
minimizing the mean squared error between the transformed input tracklets 
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and the prototype. The loss for a set of N input tracklets can be defined as 


N 
1 ak lik, 2 
Zpoti(OnoiOrw) = N Gi dos S rotos * (4.37) 
i-1 


Alternatively, the Huber loss function [Hub64] or M-estimators [Ham11] can 
be used. The aligned tracklets can then serve as input for the filtering or 
prediction model. An RNN-based prediction model for the next Kprea steps 
can be defined as 


k b. k bs-l a P 
henc = RNN(hene® Kops; Bono). 


gr -Kobst1:Kobs+Kprea = MLP(hbobs; © prea): (4.38) 


The estimated next steps in the transformed tracklet space are mapped back 
to the original input space by the backward network (BW : 2* — 2). The 
backward network performs the affine transformations in reverse order. 


1 P & = 1:Kops- 
tz pw. fz Bw.Szoppw = MLP(z “obs; Ogw), 


akopstl:Kopstk ak Kopstl:Kopstk 2 a 8 
g*obs obs't*pred — BW(2 obs obs pred t» Bw fz pw. Sz pw) 


(4.39) 


Accordingly, the actual prediction is done in the transformed space, where 
all input tracklets are ideally in alignment with the prototype in a scene- 
independent reference system. Thus, extensive deviations from the proto- 
type tracklets can be used to identify out-of-distribution input tracklets, which 
can lead to poor prediction results. Further, the prototype reflects the min- 
imum variation of the dynamical behavior and enables to draw conclusions 
to the dominating dynamics in the training samples. The observed sequence 
can also be used as input for the backward network to incorporate spatial- 
dependent contextual cues into the overall network. Instead of only apply- 
ing the inverse transformations from the forward network, the prediction can 
by further adapted. With this proposed cascade of successive transformation 
and estimation steps, a distinction between the spatial-context and temporal- 
context is realized. Similarly to sections discussed above, the prediction or 
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filtering network can be trained by minimizing the loss in form of the nega- 
tive log-likelihood of the ground truth positions under the predicted positions 
or using the mean squared error between ground truth positions and the pre- 
dicted positions. A combined network of an alignment network together with 
a prediction network is visualized in figure 4.24. 
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Figure 4.24: Visualization of the alignment network with an integrated RNN-based prediction 
network. The forward-network transforms the input tracklet 255. = zik 
align with the prototype tracklet. Prediction is performed in the transformed track- 
let space. Then the tracklets are transformed in reverse order to the original tracklet 


Kobs+1:Kops+Kprea by the backward-network. 


obs to 


space Z pred EE 


4.2.3.2 Evaluation 


The abilities of the alignment network are analyzed on synthetic data with 
one specific dynamical behavior. For real-world data, the path prediction se- 
quences from the BIWI sequences [Pel09] and the UCY sequences [Ler07] are 
chosen. 


The alignment models have been implemented using Tensorflow [Aba15] and 
are trained for 20000 epochs using ADAM optimizer [Kin15] with an initial 
leaning rate of 0.0003. For the experiments, the forward network is realized 
with two hidden layers of size 50. 
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4.2.3.3 Scenario: Synthetic Data 
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Figure 4.25: Visualization of aligned tracklets for the four different training sets. The prototype 
tracklet is highlighted in red sm. The resulting prototype reflects the minimum vari- 
ation in the dynamical behavior. Tracklets are visualized in an unit-less embedded 
space. 


The synthetically generated trajectories reflect a CV, a CA, and a CT dynam- 
ical behavior. Four training sets of 100 noise-free tracklets of 8 steps with a 
frame rate of 1 fps are generated. All starting positions and heading directions 
are uniformly sampled from an origin interval of [-10 m, 10 m] and from a 
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rotation interval of [—90.0^, 90.0^]. For the first set, the agent speed is uni- 
formly sampled from [5 m/s, 25 m/s] and is kept fixed. For the second set, an 
additional acceleration is sampled from the interval of [5 m/s?, 25 m/s?]. Thus, 
the second set includes only agents according to a CA model. The third and 
the fourth training set include in addition to the walking speed a turning com- 
ponent uniformly sampled from [5°, 30°] and [—15°, 15°]. Thus, the agents of 
the last sets perform a curvilinear motion where the one set is biased towards 
one turning direction. The resulting prototype m and the aligned tracklets of 
the four training sets are visualized in figure 4.25. 


The images depict how the learned prototype reflects the underlying dynam- 
ics. For the top left image, the prototype is shaped like a straight line with 
equidistant points according to a CV model. The top right image shows a 
prototype that reflects the additional acceleration. The images on the bot- 
tom correspond to curvilinear motion. For the left image, the bias towards 
one rotation direction can be seen. For the right image, the prototype is ad- 
justed to reflect the minimum variance of the input tracklets. For all train- 
ing sets, the variation in translation and rotation is removed. Theoretically, 
the alignment results are independent of data pre-processing such as position 
normalization. Nevertheless, data pre-processing also helps to make training 
more stable. The important difference is that the reference for all tracklets is 
not arbitrarily chosen but learned. This enables to assess the input tracklet 
due to their distance to a common reference - the prototype of the bottleneck 
network. 


4.2.3.4 Scenario: Real-World Data 


As part of the TrajNet dataset collection, the BIWI [Pel09] dataset and the 
UCY [Ler07] dataset are described in section 4.2.1. Although the datasets in- 
clude pedestrians with varying motion types, most pedestrians walk straight 
corresponding to a CV model (see section 4.2.1.1). 


Here, the datasets are used to analyze if the alignment network is able to learn 
a reasonable prototype from real-world data and to assess the input tracklets. 
In figure 4.26, the alignment results for input tracklets of length 8 of the BIWI 
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ETH and the UCY ZARA1 dataset are visualized. The resulting prototype = 
reflects a nearly constant walking behavior. For both datasets, the variation 
due to affine transformations is compensated by the alignment network. 


BIWI ETH UCY ZARAI1 


Figure 4.26: Alignment results for the BIWI-ETH and UCY-ZARA1 sequences. The prototype 
tracklet is highlighted in red sm. Tracklets are visualized in an unit-less embedded 
space. 


Visualization of the aligned tracklets for different steps during training for the 
BIWI-ETH dataset are depicted in figure 4.27. The images show how the pa- 
rameters of the forward network and bottleneck network are jointly adapted 
to remove the translation and rotation variation. The effect how the prototype 
is adjusted over time is also clearly visible. 


The distance of the input tracklets to the learned prototype for the BIWI ETH 
and UCY ZARA1 sequences are visualized in figure 4.28. For visualization, 
the input tracklets are color-coded in accordance with a sequential colormap 
using the L2 distance to the prototype. The strongest outliers for both datsets 
correspond to standing or loitering persons. This can be better exposed by 
using the translation and rotation normalized input tracklets, as shown in 
figure 4.29. Since the result for the normalized input data is similar to using 
un-processed tracklets, this also demonstrates the robustness ofthe alignment 
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network to remove affine transformation variation from the input tracklets. 


In figure 4.29, the larger distance to the prototype of person tracklets close 


to the origin is visible. The largest distances correspond to persons walking 


slowly with some sort of loitering behavior. This complies with the statements 


given in section 4.2.1.1. 


Figure 4.27: Visualization of aligned tracklets from the the BIWI ETH dataset for different time 
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steps during training. The prototype tracklet is highlighted in red m. The images 
depict the joint learning of the transformations of the tracklets and adjusting the 
prototype. Tracklets are visualized in an unit-less embedded space. 
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Figure 4.28: Color-coded input tracklets from the BIWI ETH and UCY ZARA1 datasets. The 
color-coding is done based on the L2 distance (0 = — m max) between a track- 
let and the learned prototype. 


4.2.3.5 Summary 


The presented results show that the proposed alignment network offers new 
possibilities to assess input tracklet data. Due to the fact that trajectory clus- 
tering approaches mainly rely on time-varying positions, it is clear that the 
required information is mostly removed by applying normalization. The pro- 
totype provides a reference without shifting the variation along the tracklets. 
Thereby, the conditions to apply clustering approaches or out-of-distribution 
detection are improved. Moreover, the prototype is adjusted to match with 
the minimum variance in the input tracklets. Thus, it reflects the prototypical 
dynamical behavior. The proposed alignment network offers a promising di- 
rection for future research to better separate the temporal-dependent motion 
cues from the spatial-dependent environmental cues. 


151 


4 The Deep Learning Perspective 


BIWI ETH 
(normalized translation and rotation) 
8 4 
6 4 
g ^] 
E 
vo 
E 
B 
a 27 
0 4 
=2 T T T T T 
2 = 0 1 2 
X in meters 


Figure 4.29: Visualization based on the L2 distance (0 = — m max) between normalized track- 
lets (translation and rotation) from the BIWI ETH dataset and the corresponding 
learned prototype. 
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This thesis addressed the problem of state estimation of maneuvering objects 
as part of a visual detection-by-tracking system with a focus on applications 
in the surveillance and intelligent vehicle domains. Object observations pro- 
vided by an appearance model, describing the object in image space, serve as 
input for recursive Bayesian filters or respectively for proposed RNN-based 
alternatives. 


After discussing the interconnected Bayesian and deep learning func- 
tional viewpoints on state estimation, the IMM filter, as the most common 
representative based on a Bayesian formulation for dealing with model mis- 
matches or maneuvering objects, was selected as our reference approach. For 
a model mismatch scenario of directly tracking in image space, this thesis 
contributes to an improved design of a basic IMM filter as a top-down filtering 
approach by introducing a state de-coupling and a re-coupling scheme. 


The benefit of the suggested de-coupling scheme of an IMM filter was demon- 
strated for prototypical visual object tracking sequences, where the estimation 
ofthe mapping function to a 3D physical reference system is so far an widely 
unsolved problem. For better dealing with the corresponding observation un- 
certainties in these conditions, a state re-coupling scheme was introduced. 
Thereby, an implicit depth prior, which is connected through the object scales, 
enables a scale-dependent adaptation of the observation noise levels. 


In order to reduce the amount of required engineering and to learn an im- 
proved process model set structure, the IMM functionality was transferred 
into a comparable deep learning architecture. Since the IMM filter is in par- 
ticular designed to deal with the maneuver types of switching noise levels and 
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switching dynamics, the proposed RNN-based networks were correspond- 
ingly analyzed with regards to both maneuver types. This was done for the 
exemplary tasks of path prediction and intention prediction. Due to the fact that 
there exists a public standard benchmark, path prediction was mainly used for 
comparison to related approaches for motion prediction methods. The data- 
set analysis revealed that the trajectory data reflects the desired properties 
of observations provided by an underlying visual tracking component with 
varying noise levels. Despite the simple core architecture of an RNN-encoder 
with a dense layer for mapping back in the observation space, the proposed 
RNN network yielded the top-rank on World H-H TrajNet challenge, and thus 
achieved a performance comparable to related current state-of-the-art meth- 
ods. The presented modification, such as overshooting, helped to enhance the 
prediction performance in the presence of varying noise levels. 


The ability of proposed solutions with respect to the switching dynamics of 
objects was evaluated for intention prediction in the application domain of 
intelligent vehicles. In extensive experiments on synthetic and real-world 
datasets, the proposed RNN-based IMM filter surrogate (RNN-IMM) obtained 
a performance boost over existing proposed IMM filter configuration tailored 
to specific maneuver scenarios. Similar to an IMM filter solution, the pre- 
sented RNN-IMM assigns a probability value to a dynamics and, based on 
them, puts out a multi-modal distribution over future object states. The RNN- 
IMM achieved a better performance for jointly estimating intentions and fu- 
ture paths. Also for a pure intention classification task, the (RNN-IMM) yields 
a performance boost compared to IMM filters. The amount of filter tuning is 
reduced due to a direct estimate of the dynamics probability, and thus there 
is no explicit modeling of the transition probabilities. Although in the ana- 
lyzed maneuver scenarios several tailored IMM solutions exist, the RNN-IMM 
captures the switch in dynamics more reliable. 


Instead of utilizing the state estimator as a top-down module in a visual track- 
ing system, one direction for future research is the end-to-end reasoning on 
object motion directly from image sequences. Here, relying on an interme- 
diate object state representation was a design choice to allow, among other 
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things, a fairer comparison between Bayesian filters and RNN-based alterna- 
tives. However, more and more end-to-end formulations of different track- 
ing tasks are introduced. Besides currently achieving lower performance on 
benchmarks [Lea17], end-to-end solutions are a future direction to overcome 
requirements on system identification in the form of dynamics or observation 
models, and on any feature engineering for the appearance model. 


As discussed in section 2.3, developing sophisticated methods for motion pre- 
diction which go beyond Kalman filtering is a clear trend. Due to the rapidly 
expanding field and thus the amount of new diverse methods, there is a need 
for improved standardized prediction benchmarking. Especially for bench- 
marking prediction with contextual cues, a well-defined categorization of the 
underlying data into specific conditions is crucial. The presented results re- 
vealed that it is required to first ensure a meaningful learned representation 
for single object dynamics rather than just to increase the model complexity 
by simply adding cues. Nevertheless, more contextual cues are undeniably 
necessary in order to improve object behavior anticipation. More methods, in 
particular deep learning-based methods, start to include the global structure 
of the environment and allow better estimates of context-dependent patterns 
in real-world data. Thus, intelligent autonomous systems require an in-depth 
semantic scene understanding to predict object motion or to plan and navigate 
alongside them [Rud20]. Contextual understanding with respect to features 
of the static environment and dynamic environment offers many options for 
future research to explore. In order to better separate spatial-independent mo- 
tion cues from the spatial-dependent environmental cues, the proposed align- 
ment network offers a promising direction for future research. 
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